Search

Things and Stuff Wiki - An organically evolving personal wiki knowledge base. An on-the-fly taxonomy containing a patchwork trail of topic outlines, descriptions, notes, stubs and breadcrumbs, with links to sites, systems, software, manuals, organisations, people, articles, guides, slides, papers, books, comments, videos, screencasts, webcasts, scratchpads and more. Content is orientated towards mostly free/libre/open, mostly Linux. Quality and age varies drastically. Sometimes old things are first, sometimes last. Use the Table of Contents menu to navigate long pages. Zoom in if text is too small. Dead link? Wayback Machine. I probably need to fix the theme CSS after an update. See also libreav.org. Chat to msg me (not checking tho atm). e

General

Tools

Sherlock - a powerful command line tool provided by Sherlock Project, can be used to find usernames across many social networks. It requires Python 3.6 or higher and works on MacOS, Linux and Windows.
- https://github.com/sherlock-project/sherlock

Surfraw

Surfraw - provides a fast unix command line interface to a variety of popular WWW search engines and other artifacts of power. It reclaims google, altavista, babelfish, dejanews, freshmeat, research index, slashdot and many others from the false-prophet, pox-infested heathen lands of html-forms, placing these wonders where they belong, deep in unix heartland, as god loving extensions to the shell.

Searx

Searx - a free internet metasearch engine which aggregates results from more than 70 search services. Users are neither tracked nor profiled. Additionally, searx can be used over Tor for online anonymity.

Local files

to fill out

Recoll

Recoll - a desktop full-text search tool.Recoll finds documents based on their contents as well as their file names. Versions are available for Linux and MS Windows and Mac OS X. A WEB front-end with preview and download features can replace or supplement the GUI for remote use. It can search most document formats. You may need external applications for text extraction. It can reach any storage place: files, archive members, email attachments, transparently handling decompression. One click will open the document inside a native editor or display an even quicker text preview. The software is free, open source, and licensed under the GPL. Detailed features and application requirements for supported document types.

https://github.com/andersju/zzzfoo - This script lets you combine the excellent full-text search tool Recoll with Rofi (popular dmenu replacement, among other things) to quickly search all your indexed files. It simply performs a Recoll search using Recoll's Python module, pipes the output to rofi -dmenu, and (optionally) does something with the selected match. If you only need file paths in your results, forget this script and just grep/sed the result of recoll -t -e -b <foobar> and pipe that into rofi or dmenu; this script is basically a one-liner that got out of hand. However, if you want titles and MIME types and highlighted extracts and colors, as well as various options, keep reading.

Beagle

Beagle - a search tool that ransacks your personal information space to find whatever you're looking for. Beagle can search in many different domains.

Baloo

https://github.com/KDE/baloo - a framework for searching and managing metadata.

docs/user/searching.md · master · Frameworks / Baloo · GitLab

Systems

Apache Lucene

Apache Lucene - project develops open-source search software. The project releases a core search library, named Lucene™ core, as well as PyLucene, a python binding for Lucene.
- https://github.com/apache/lucene

https://en.wikipedia.org/wiki/Apache_Lucene

Apache Solr

Apache Solr - the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

http://wiki.apache.org/solr
- http://en.wikipedia.org/wiki/Apache_Solr
- http://lucene.apache.org/solr

http://www.apache.org/dyn/closer.cgi/lucene/solr

update; jetty much easier!

http://wiki.apache.org/solr/SolrInstall
- http://wiki.apache.org/solr/SolrTomcat
- http://forge.bearstech.com/trac/wiki/DebianSolrInstall

/var/lib/tomcat6/webapps/solr
/etc/tomcat6/Catalina/localhost/solr.xml

http://www.mkyong.com/tomcat/tomcat-default-administrator-password/

Blacklight - an open source Solr user interface discovery platform. You can use Blacklight to enable searching and browsing of your collections. Blacklight uses the Apache Solr search engine to search full text and/or metadata. Blacklight has a highly configurable Ruby on Rails front-end. Blacklight was originally developed at the University of Virginia Library and is made public under an Apache 2.0 license.
- https://github.com/projectblacklight/blacklight

CLucene

CLucene - a high-performance, scalable, cross platform, full-featured, open-source indexing and searching API. Specifically, CLucene is the guts of a search engine, the hard stuff. You write the easy stuff: the UI and the process of selecting and parsing your data files to pump them into the search engine yourself, and any specialized queries to pull it back for display or further processing. CLucene is a port of the very popular Java Lucene text search engine API. CLucene aims to be a good alternative to Java Lucene when performance really matters or if you want to stick to good old C++. CLucene is faster than Lucene as it is written in C++, meaning it is being compiled into machine code, has no background GC operations, and requires no any extra setup procedures.
- https://github.com/synhershko/clucene

Elasticsearch

Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elasticsearch gives you the ability to move easily beyond simple full-text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elasticsearch delivers on the near limitless promises of search technology
- https://en.wikipedia.org/wiki/Elasticsearch

https://github.com/sonian/elasticsearch-jetty - brings full power of Jetty and adds several new features to elasticsearch. With this plugin elasticsearch can now handle SSL connections, support basic authentication, and log all or some incoming requests in plain text or json formats.

Sphinx

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems. Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server.

Techu exposes a RESTful API for realtime indexing and searching with the Sphinx full-text search engine. We leverage Redis, Nginx and the Python Django framework to make searching easy to handle & flexible.

Majestic-12

Majestic-12 is working towards creation of a World Wide Web search engine based on concepts of distributing workload in a similar fashion achieved by successful projects such as SETI@home and distributed.net.

MJ12bot, the principal distributed component of the Majestic-12 search engine project is the subject of continuing investment by Majestic-12 Ltd. The results of this crawl are fed into a specialised search engine with daily updates. A full explanation follows.

Toshi

https://github.com/toshi-search/Toshi - A Full Text Search Engine in Rust Based on Tantivy

GNES

GNES is Generic Neural Elastic Search - https://github.com/gnes-ai/gnes [2]

Typesense

Typesense - open source, typo tolerant search engine for everyone. - an open source, typo tolerant search engine that delivers fast and relevant results out-of-the-box.
- https://github.com/typesense/typesense

Stork

Stork - a library for creating beautiful, fast, and accurate full-text search interfaces on the web.It comes in two parts. First, it's a command-line tool that indexes content and creates a search index file that you can upload to a web server. Second, it's a Javascript library that uses that search index file to build an interactive search interface that displays optimal search results immediately to your user, as they type.Stork is built with Rust, and the Javascript library uses WebAssembly behind the scenes. It's easy to get started and is even easier to customize so it fits your needs. It's perfect for Jamstack sites and personal blogs, but can be used wherever you need to bring search to your users.
- https://github.com/jameslittle230/stork

Woosh

Whoosh
- https://github.com/mchaput/whoosh

Scout

Scout - aims to be a lightweight, RESTful search server in the spirit of ElasticSearch, powered by the SQLite full-text search extension. In addition to search, Scout can be used as a document database, supporting complex filtering operations. Arbitrary files can be attached to documents and downloaded through the REST API.
- https://github.com/coleifer/scout

Typesense

Typesense - Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch. Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences
- https://github.com/typesense/typesense

Xapian

Xapian - an Open Source Search Engine Library, released under the GPL v2+. It's written in C++, with bindings to allow use from Perl, Python 2, Python 3, PHP, Java, Tcl, C#, Ruby, Lua, Erlang, Node.js and R (so far!) Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It has built-in support for several families of weighting models and also supports a rich set of boolean query operators.
- https://trac.xapian.org/browser/git/xapian-core?order=name

Bleve

Bleve - A modern text indexing library for go
- https://github.com/blevesearch/bleve

Sonic

https://github.com/valeriansaliou/sonic - Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM. Rust.

https://github.com/moshe/asonic - async python client for the sonic search backend

Toshi

https://github.com/toshi-search/Toshi - A full-text search engine in rust

Tantivy

https://github.com/quickwit-oss/tantivy - a full-text search engine library inspired by Apache Lucene and written in Rust

Elasticlunr.js

Elasticlunr.js - a lightweight full-text search engine in Javascript for browser search and offline search. Elasticlunr.js is developed based on Lunr.js, but more flexible than lunr.js. Elasticlunr.js provides Query-Time boosting and field search. Elasticlunr.js is a bit like Solr, but much smaller and not as bright, but also provide flexible configuration and query-time boosting.
- https://github.com/weixsong/elasticlunr.js

Fuse.js

Fuse.js - Powerful, lightweight fuzzy-search library, with zero dependencies.
- https://github.com/krisk/Fuse

Lunr.js

Lunr.js - A bit like Solr, but much smaller and not as bright.
- https://github.com/olivernn/lunr.js

Lunr.py

Lunr.py - A Python implementation of Lunr.js by Oliver Nightingale.
- https://github.com/yeraydiazdiaz/lunr.py

Meilisearch

Meilisearch - A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow
- https://github.com/meilisearch/MeiliSearch

Services

Google

Search operators

http://www.google.com/insidesearch/howsearchworks/thestory/

https://en.wikipedia.org/wiki/RankBrain - a machine learning-based search engine algorithm, the use of which was confirmed by Google on 26 October 2015. [1] It helps Google to process search results and provide more relevant search results for users.[2] In a 2015 interview, Google commented that RankBrain was the third most important factor in the ranking algorithm along with links and content.[2] As of 2015, "RankBrain was used for less than 15% of queries." [3] The results show that RankBrain produces results that are well within 10% of the Google search engine engineer team

http://www.googleguide.com/

http://code.google.com/p/googlecl/ - commandline

Google Custom Search Engine

https://www.google.com/webmasters/tools

http://moz.com/google-algorithm-change

http://www.seomofo.com/snippet-optimizer.html

https://medium.com/@dannypage/stop-using-google-trends-a5014dd32588#.by330oyng [3]

DuckDuckGo

https://duckduckgo.com/

https://github.com/jarun/ddgr - duck DuckDuckGo from the terminal

Wolfram Alpha

Wolfram|Alpha - Compute expert-level answers using Wolfram’s breakthrough algorithms, knowledgebase and AI technology

Wolfram Prompt Repository

Prompts for Work & Play: Launching the Wolfram Prompt Repository—Stephen Wolfram Writings

Which Is Closer: Local Beer or Local Whiskey?—Wolfram|Alpha Blog [4]

Other

http://symbolhound.com/

https://www.qwant.com/

Swisscows - the alternative, data secure search engine.

https://www.yandex.com/

https://yahoo.com/

Startpage.com - The world's most private search engine

Searxes

robots.txt

https://en.wikipedia.org/wiki/Robots_exclusion_standard - also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize web sites. Not all robots cooperate with the standard; email harvesters, spambots and malware robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard is different from, but can be used in conjunction with Sitemaps, a robot inclusion standard for websites.

http://www.robotstxt.org/

Bad robots (csf blocked);

bingbot - 207.46.13.0/24, 157.55.39.0/24
majestic-12 - 136.243.103.165

https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker - Nginx Block Bad Bots, Spam Referrer Blocker, Vulnerability Scanners, User-Agents, Malware, Adware, Ransomware, Malicious Sites, with anti-DDOS, Wordpress Theme Detector Blocking and Fail2Ban Jail for Repeat Offenders

https://www.ttla.com/robots.txt

Alerts

Services

https://en.mention.net/

http://www.talkwalker.com/alerts

SEM/SEO

http://en.wikipedia.org/wiki/Search_engine_marketing

http://en.wikipedia.org/wiki/Search_engine_optimization

SEO For Web Engineers: 38 Hard-Earned Lessons and Tips - [5]

Mashable: HOW TO: Optimize Your Site for Search Engine Marketing

http://blog.mythly.com/increased-search-traffic-page-load-speed/

http://moz.com/

http://aberrant.me/no-google-authorship-didnt-decrease-your-traffic-by-90/

http://www.slideshare.net/aliceaudrey/seo-sem-smo-project

Mashable: 12 Questions to Ask Before Hiring an SEO Expert

The SEO Rapper - Page Rank

Google

Google Tag Manager lets you add and update your website tags, easily and for free, whenever you want, without bugging the IT folks. It gives marketers greater flexibility, and lets webmasters relax and focus on other important tasks.
- Introduction to Google Tag Manager

Google Keyword Planner is like a workshop for building new Search Network campaigns or expanding existing ones. You can search for keyword and ad group ideas, get historical statistics, see how a list of keywords might perform, and even create a new keyword list by multiplying several lists of keywords together. A free AdWords tool, Keyword Planner can also help you choose competitive bids and budgets to use with your campaigns.

News

http://inbound.org/

Articles etc.

What Every Programmer Should Know About SEO

Are there any clear indicators that my sitemap file is beneficial?

https://segment.io/academy/the-quickest-wins-in-seo

https://ma.ttias.be/technical-guide-seo/ [6]

http://www.searchenabler.com/blog/meta-description-tags-complete-guide/

Search Engine Optimization (SEO) Kit

Why Your Links Should Never Say “Click Here”

Creating a semantic breadcrumb using HTML5 microdata

Sitemaps

Sitemaps - an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

The Importance of Sitemaps - Oct 12, 2008

Distributed

YaCy

YaCy - a free search engine that anyone can use to build a search portal for their intranet or to help search the public internet. When contributing to the world-wide peer network, the scale of YaCy is limited only by the number of users in the world and can index billions of web pages. It is fully decentralized, all users of the search engine network are equal, the network does not store user search requests and it is not possible for anyone to censor the content of the shared index. We want to achieve freedom of information through a free, distributed web search which is powered by the world's users.
- https://github.com/yacy/yacy_search_server

Other

https://news.ycombinator.com/item?id=6821792

http://www.google.com/blogsearch

http://www.symbolhound.com/

http://globalfilesearch.com/

http://filecrop.com/

http://michaelyingling.com/random/calvin_and_hobbes/

http://sw.deri.org/2009/01/visinav/

http://commoncrawl.org/

https://millionshort.com/ [7]

http://www.pornmd.com/sex-search

http://www.sympygamma.com/ - OSS W|A [8]

https://news.ycombinator.com/item?id=8902728

http://surfraw.alioth.debian.org/ [9]
- https://wiki.archlinux.org/index.php/Surfraw

https://nerdydata.com/search - web site source code

OSINT

https://en.wikipedia.org/wiki/Open-source_intelligence - (OSINT) is data collected from publicly available sources to be used in an intelligence context. In the intelligence community, the term "open" refers to overt, publicly available sources (as opposed to covert or clandestine sources). It is not related to open-source software or collective intelligence.OSINT under one name or another has been around for hundreds of years. With the advent of instant communications and rapid information transfer, a great deal of actionable and predictive intelligence can now be obtained from public, unclassified sources.

SpiderFoot

SpiderFoot - an open source intelligence (OSINT) automation tool. It integrates with just about every data source available and utilises a range of methods for data analysis, making that data easy to navigate. SpiderFoot has an embedded web-server for providing a clean and intuitive web-based interface but can also be used completely via the command-line. It's written in Python 3 and GPL-licensed. [10]
- https://github.com/smicallef/spiderfoot

Remembrance agent

Remembrance Agent: A continuously running automated information retrieval system - a program which augments human memory by displaying a list of documents which might be relevant to the user's current context. Unlike most information retrieval systems, the RA runs continuously without user intervention. Its unobtrusive interface allows a user to pursue or ignore the RA's suggestions as desired.

Publications: Cambridge Technical Reports: EPC-1994-103 - At RXRC we have been trying to understand how anticipated developments in mobile computing will impact our customers in the 21st century. One opportunity we can see is to improve computer-based support for human memory -- ironically a problem in office systems research that has almost been forgotten. Considering how often computers are presented as devices capable of memorising vast quantities of information, and performing difficult-to-memorise sequences of operations on our behalf, we might be surprised at how often they appear to have increased the load on our own memory. The Forget-me-not project is an attempt to explore new ways in which mobile and ubiquitous technologies might help alleviate the increasing load. Forget-me-not is a memory aid designed to help with everyday memory problems: finding a lost document, remembering somebody's name; recalling how to operate a piece of machinery. It exploits some well understood features of human episodic memory to provide alternative ways of retrieving information that was once known but has now been forgotten. We start by introducing a model of computing in the 21st century which we call the Intimate Computing model and talk about some of the opportunities and problems we anticipate it will provoke. After cursory introduction to the basics of human episodic memory, we describe the architecture and user interface of Forget-me-not. We end with a few preliminary conclusions drawn from our early experiences with the prototype.

https://github.com/zzkt/remembrance-agent - As of 2021, this code hasn’t been updated to work with recent versions of emacs or received much in the way of maintenance. The project links are inactive and it appears abandoned by the author. Mainly kept for historical purposes. If you are looking for a knowledge management system for emacs that is currently maintained, have a look at org-mode or org-roam or one of the zettelkasten modes…

Search

General

Tools

Surfraw

Searx

Local files

Recoll

Beagle

Baloo

Systems

Apache Lucene

Apache Solr

CLucene

Elasticsearch

Sphinx

Majestic-12

Toshi

GNES

Typesense

Stork

Woosh

Scout

Typesense

Xapian

Bleve

Sonic

Toshi

Tantivy

Elasticlunr.js

Fuse.js

Lunr.js

Lunr.py

Meilisearch

Services

Google

DuckDuckGo

Wolfram Alpha

Other

robots.txt

Alerts

Services

SEM/SEO

Google

News

Articles etc.

Sitemaps

Tools

JavaScript

Bad

Other

Distributed

YaCy

Other

OSINT

SpiderFoot

Remembrance agent

Navigation menu

Search