- 1 General
- 2 Local files
- 3 Systems
- 4 Services
- 5 robots.txt
- 6 Alerts
- 7 SEM/SEO
- 8 Distributed
- 9 Other
See also Organising
- Surfraw - provides a fast unix command line interface to a variety of popular WWW search engines and other artifacts of power. It reclaims google, altavista, babelfish, dejanews, freshmeat, research index, slashdot and many others from the false-prophet, pox-infested heathen lands of html-forms, placing these wonders where they belong, deep in unix heartland, as god loving extensions to the shell.
to fill out
- Recoll - a desktop full-text search tool.Recoll finds documents based on their contents as well as their file names. Versions are available for Linux and MS Windows and Mac OS X. A WEB front-end with preview and download features can replace or supplement the GUI for remote use. It can search most document formats. You may need external applications for text extraction. It can reach any storage place: files, archive members, email attachments, transparently handling decompression. One click will open the document inside a native editor or display an even quicker text preview. The software is free, open source, and licensed under the GPL. Detailed features and application requirements for supported document types.
- https://github.com/andersju/zzzfoo - This script lets you combine the excellent full-text search tool Recoll with Rofi (popular dmenu replacement, among other things) to quickly search all your indexed files. It simply performs a Recoll search using Recoll's Python module, pipes the output to rofi -dmenu, and (optionally) does something with the selected match. If you only need file paths in your results, forget this script and just grep/sed the result of recoll -t -e -b <foobar> and pipe that into rofi or dmenu; this script is basically a one-liner that got out of hand. However, if you want titles and MIME types and highlighted extracts and colors, as well as various options, keep reading.
- Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.
update; jetty much easier!
- Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elasticsearch gives you the ability to move easily beyond simple full-text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elasticsearch delivers on the near limitless promises of search technology
- Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems. Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server.
- Techu exposes a RESTful API for realtime indexing and searching with the Sphinx full-text search engine. We leverage Redis, Nginx and the Python Django framework to make searching easy to handle & flexible.
- Majestic-12 is working towards creation of a World Wide Web search engine based on concepts of distributing workload in a similar fashion achieved by successful projects such as SETI@home and distributed.net.
MJ12bot, the principal distributed component of the Majestic-12 search engine project is the subject of continuing investment by Majestic-12 Ltd. The results of this crawl are fed into a specialised search engine with daily updates. A full explanation follows.
- https://github.com/toshi-search/Toshi - A Full Text Search Engine in Rust Based on Tantivy
- https://en.wikipedia.org/wiki/RankBrain - a machine learning-based search engine algorithm, the use of which was confirmed by Google on 26 October 2015.  It helps Google to process search results and provide more relevant search results for users. In a 2015 interview, Google commented that RankBrain was the third most important factor in the ranking algorithm along with links and content. As of 2015, "RankBrain was used for less than 15% of queries."  The results show that RankBrain produces results that are well within 10% of the Google search engine engineer team
- http://code.google.com/p/googlecl/ - commandline
- https://github.com/jarun/ddgr - duck DuckDuckGo from the terminal
- https://en.wikipedia.org/wiki/Robots_exclusion_standard - also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize web sites. Not all robots cooperate with the standard; email harvesters, spambots and malware robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard is different from, but can be used in conjunction with Sitemaps, a robot inclusion standard for websites.
Bad robots (csf blocked);
- bingbot - 18.104.22.168/24, 22.214.171.124/24
- majestic-12 - 126.96.36.199
- Google Tag Manager lets you add and update your website tags, easily and for free, whenever you want, without bugging the IT folks. It gives marketers greater flexibility, and lets webmasters relax and focus on other important tasks.
- Google Keyword Planner is like a workshop for building new Search Network campaigns or expanding existing ones. You can search for keyword and ad group ideas, get historical statistics, see how a list of keywords might perform, and even create a new keyword list by multiplying several lists of keywords together. A free AdWords tool, Keyword Planner can also help you choose competitive bids and budgets to use with your campaigns.
- Sitemaps - an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.
- The Importance of Sitemaps - Oct 12, 2008
- YaCy - a free search engine that anyone can use to build a search portal for their intranet or to help search the public internet. When contributing to the world-wide peer network, the scale of YaCy is limited only by the number of users in the world and can index billions of web pages. It is fully decentralized, all users of the search engine network are equal, the network does not store user search requests and it is not possible for anyone to censor the content of the shared index. We want to achieve freedom of information through a free, distributed web search which is powered by the world's users.
- https://nerdydata.com/search - web site source code