Scraping
Jump to navigation
Jump to search
General
- https://en.wikipedia.org/wiki/Data_scraping - a technique where a computer program extracts data from human-readable output coming from another program. Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity. Very often, these transmissions are not human-readable at all. Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing. Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.
- https://en.wikipedia.org/wiki/Web_scraping - web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
to sort
- http://search.cpan.org/~ether/WWW-Mechanize-1.75/lib/WWW/Mechanize.pm
- https://pypi.python.org/pypi/mechanize/
- https://code.google.com/archive/p/flying-saucer/ - render html to pdf
- https://github.com/Y2Z/monolith - Save HTML pages with ease
- ScraperWiki - Accurately extract tables from web pages and PDFs
- Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
- [https://github.com/scrapinghub/portia - a tool for visually scraping web sites using Scrapy without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages from the site.
- https://github.com/cantino/huginn - like yahoo pipes
ub
- https://github.com/DormyMo/SpiderKeeper - admin ui for scrapy/open source scrapinghub
- A Python package & command-line tool to gather text on the Web — trafilatura 1.5.0 documentationa Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.This tool can be useful for quantitative research in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.
Kiwix
- Kiwi - Wherever you go, you can browse Wikipedia, read books from the Gutenberg Library, or watch TED talks and much more – even if you don’t have an Internet connection. Make highly compressed copies of entire websites that each fit into a single (.zim) file. Zim files are small enough that they can be stored on users’ mobile phones, computers or small, inexpensive Hotspot. Kiwix then acts like a regular browser, except that it reads these local copies. People with no or limited internet access can enjoy the same browsing experience as anyone else. The software as well as the content are fully open-source and free to use and share.