Scraping

From Things and Stuff Wiki

Jump to navigation Jump to search

this is an organically evolving personal wiki-form knowledge base, with on-the-fly/twenty years of copy-edited n otherwise curated patchworks of folksnomies n headings, containing trails n spirals of topics, descriptions, notes, breadcrumbs n stubs, links to sites, systems, software, manuals, organisations, people, articles, guides, slides, papers, books, comments, videos, screencasts, webcasts, scratchpads, etc | content is orientated towards mostly free/libre/open, mostly Linux | quality and age varies drastically | sometimes old things are first, sometimes last | Ctrl + mouse wheel to zoom in if text is too small | use the Table of Contents menu to navigate long pages | use the header -ToC links to shrink n expand the menu | link rot? Wayback Machine! | probably need to fix the theme CSS after an update | Chat to msg me (this I am not checking atm) | e

Smiley / Lorem

Maths

Breath

Being

Grounding

Living

Camping

Mapping

Organising

Media

Free/open

Volly Guide

Fire brand

Signal

Type / Emoji

Computing

Compile / build

OSs / *nix / CLI

Distros / Packages

Apple / Windows

Amiga / Emulation

Semantic

Backup

Storage / Files

Vim / Emacs

Logging / Search

Notebooks

VCS / Git

GFX / Colours

UI / X11 / GUI

Terminals / TUI

WM/DE / Wayland

Demoscene

Regex

JavaScript / Lua

Creative coding

Web dev

Web systems

Open social

Net/web media

File sharing

Web Audio

Internet / Mesh

Transport / DNS

HTTP(S) / SSH

Proxy / Virtual

Stack

MediaWiki

Photos / Images

Lighting / Laser

Video / Vision

Visuals

Audio / s / AV

Effects

Softsynths

Sampling

Sound banks

Notation

MIDI / OSC

Tracker

Generative

Playback / MPD

Net AV/media

Rip / Tag / t

DJing

Stations

General

https://en.wikipedia.org/wiki/Digital_preservation - a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and application of preservation methods and technologies, and combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.

https://en.wikipedia.org/wiki/Web_archiving - the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the Wayback Machine, which strives to maintain an archive of the entire Web. The growing portion of human culture created and recorded on the web makes it inevitable that more and more libraries and archives will have to face the challenges of web archiving. National libraries, national archives and various consortia of organizations are also involved in archiving culturally important Web content.

https://en.wikipedia.org/wiki/Data_scraping - a technique where a computer program extracts data from human-readable output coming from another program. Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity. Very often, these transmissions are not human-readable at all. Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.

https://en.wikipedia.org/wiki/Web_scraping - web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

https://en.wikipedia.org/wiki/Comparison_of_software_saving_Web_pages_for_offline_use - A number of proprietary software products are available for saving Web pages for later use offline. They vary in terms of the techniques used for saving, what types of content can be saved, the format and compression of the saved files, provision for working with already saved content, and in other ways.

Status / Changes

Website change detection, monitoring, alerts, notifications, restock alerts | changedetection.io - Loved by smart-shoppers, data-journalists, research engineers, data-scientists, security researchers, and more. Now with intelligent product restock detection! free open source website change detection, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change and Price Drop notification
- https://github.com/dgtlmoon/changedetection.io

to sort

http://www.httrack.com/page/2/

http://jakeaustwick.me/python-web-scraping-resource/ [1]

https://news.ycombinator.com/item?id=8192287

https://github.com/chfoo/wpull

https://github.com/jmcarp/robobrowser

https://code.google.com/archive/p/flying-saucer/ - render html to pdf

https://github.com/ericchiang/pup

https://github.com/Y2Z/monolith - Save HTML pages with ease

ScraperWiki - Accurately extract tables from web pages and PDFs
- https://github.com/scraperwiki/scraperwiki-python

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

[https://github.com/scrapinghub/portia - a tool for visually scraping web sites using Scrapy without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages from the site.
- http://blog.scrapinghub.com/2014/04/01/announcing-portia/ [2]

kimono - Turn websites into structured APIs from your browser in seconds [3]

https://github.com/ageitgey/node-unfluff [4]

https://www.parsehub.com [5]

https://news.ycombinator.com/item?id=8417061

http://cloudscrape.com/ [6]

https://github.com/cantino/huginn - like yahoo pipes

https://github.com/scrapinghub
- https://github.com/scrapingh

ub

https://github.com/DormyMo/SpiderKeeper - admin ui for scrapy/open source scrapinghub

https://github.com/daijro/hrequests - human requests, is a simple, configurable, feature-rich, replacement for the Python requests library.

trafilatura - a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.This tool can be useful for quantitative research in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.
- https://github.com/adbar/trafilatura

https://github.com/harvard-lil/warc-diff-tools - Comparing warc files

My Web Archive Museum 🏛️ | Demo - DEMO: Safely and efficiently embedding web archives.
- https://github.com/harvard-lil/wacz-exhibitor - Experimental proxy and wrapper for safely embedding Web Archives (warc, warc.gz, wacz) into web pages.

ReplayWeb.page - a browser-based viewer that loads web archive files provided by the user and renders them for replay in the browser.
- https://github.com/webrecorder/wabac.js - wabac.js - Web Archive Browsing Augmentation Client
- https://github.com/webrecorder/replayweb.page
Welcome | ReplayWeb.Page - instructional guide on how to use Webrecorder’s replayweb.page for “replaying” your web archives in WARC, CDX, and WACZ formats (you can check out our supported formats for more information).

Save Your Threads - High-fidelity capture of Twitter threads as sealed PDFs.
- https://github.com/harvard-lil/thread-keeper

scrAPIr

scrAPIr - lets you fetch data through web APIs. You can: Immediately query many already-integrated web APIs. Publish and access shared queries and data sets. Easily add new APIs you need by filling our a web form.
- https://github.com/tarfahalrashed/ScrAPIr

Kiwix

Kiwi - Wherever you go, you can browse Wikipedia, read books from the Gutenberg Library, or watch TED talks and much more – even if you don’t have an Internet connection. Make highly compressed copies of entire websites that each fit into a single (.zim) file. Zim files are small enough that they can be stored on users’ mobile phones, computers or small, inexpensive Hotspot. Kiwix then acts like a regular browser, except that it reads these local copies. People with no or limited internet access can enjoy the same browsing experience as anyone else. The software as well as the content are fully open-source and free to use and share.
- https://github.com/kiwix

Kiwix offline – Apps on Google Play
- https://github.com/kiwix/kiwix-android

Kiwix on the App Store
- https://github.com/kiwix/apple

onthespot

https://github.com/casualsnek/onthespot - qt based music downloader written in python

Archiving

https://github.com/iipc/awesome-web-archiving - An Awesome List for getting started with web archiving

ArchiveBox - a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline. You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows (WSL/Docker). You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list. It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list. The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down.
- https://github.com/ArchiveBox/ArchiveBox

Retrieved from "https://wiki.thingsandstuff.org/index.php?title=Scraping&oldid=64175"