Regex

From Things and Stuff Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.


General

  • https://en.wikipedia.org/wiki/Regular_expression - a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep, a filter.

In modern usage, "regular expressions" are often distinguished from the derived, but fundamentally distinct concepts of regex or regexp, which no longer describe a regular language. See below for details.

Regexps are so useful in computing that the various systems to specify regexps have evolved to provide both a basic and extended standard for the grammar and syntax; modern regexps heavily augment the standard. Regexp processors are found in several search engines, search and replace dialogs of several word processors and text editors, and in the command lines of text processing utilities, such as sed and AWK.

Many programming languages provide regexp capabilities, some built-in, for example Perl, JavaScript, Ruby, AWK, and Tcl, and others via a standard library, for example .NET languages, Java, Python, POSIX C and C++ (since C++11). Most other languages offer regexps via a library.


POSIX

PCRE

See also Languages#Perl

  • PCRE - Perl Compatible Regular Expressions - The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API. The PCRE library is free, even for building proprietary software.




  • https://github.com/PCRE2Project/pcre2 - a set of C functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE2 has its own native API, as well as a set of wrapper functions that correspond to the POSIX regular expression API. The PCRE2 library is free, even for building proprietary software. It comes in three forms, for processing 8-bit, 16-bit, or 32-bit code units, in either literal or UTF encoding. PCRE2 was first released in 2015 to replace the API in the original PCRE library, which is now obsolete and no longer maintained. As well as a more flexible API, the code of PCRE2 has been much improved since the fork.

Guides






Web tools

  • regex101.com - Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript
  • RegExr - an online tool to learn, build, & test Regular Expressions (RegEx / RegExp). Results update in real-time as you type. Roll over a match or expression for details. Save & share expressions with others. Explore the Library for help & examples. Undo & Redo with Ctrl-Z / Y. Search for & rate Community patterns.


  • Debuggex - Online visual regex tester. JavaScript, Python, and PCRE.


  • Rubular - a Ruby regular expression editor
  • PCREck - a multi-dialect regular expression editor
  • txt2re - regular expression generator (perl php python java javascript coldfusion c c++ ruby vb vbscript j# c# c++.net vb.net)

Search


grep

  • https://en.wikipedia.org/wiki/grep - a command-line utility for searching plain-text data sets for lines matching a regular expression. Its name comes from the ed command g/re/p (globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. Grep was originally developed for the Unix operating system, but is available today for all Unix-like systems.
grep "apple" *.txt
grep ^a.ple oldbashimplementations.txt
  # begin with the letter a, followed by any one character, followed by the letter sequence ple.
grep "stuff" sqldump.sql | fold -w 200 | grep -C 1 "stuff"

The first grep gets the (mile-wide) line that has the match, then fold will split the mile-wide line into 200 char long lines, and "grep -C 1" will show only the one 200 char wide line where the match is + 1 line of context before and after. [9]


sgrep

sift

ack

ag

ag --hidden --ignore .git --ignore .winscp -l -g ""
  # lists all files

ripgrep

rg 'foo' --files-with-matches | xargs sed -i 's/foo/bar/g'
  (GNU sed)
rg 'foo' --files-with-matches | xargs sed -i  's/foo/bar/g'
  (BSD sed) <-- this includes OSx [14]

strings

  • https://en.wikipedia.org/wiki/strings_(Unix) - a program in Unix-like operating systems that finds and prints text strings embedded in binary files such as executables. It can be used on object files and core dumps. Strings are recognized by looking for sequences of at least 4 (by default) printable characters terminating in a NUL character (that is, null-terminated strings). Some implementations provide options for determining what is recognized as a printable character, which is useful for finding non-ASCII and wide character text. Common usage includes piping its output to grep and fold or redirecting the output to a file.


Google Code Search


qgrep


drep

  • https://github.com/maxpert/drep - grep with dynamic reloadable filter expressions. This allows filtering stream of logs/lines, while changing filters on the fly.

CUDA grep

  • CUDA grep - We successfully created a parallel regular expression matcher using CUDA for GPUs. Our implementation is anywhere from 2x-10x faster than grep depending on the workload and about 68x faster than the perl regex engine. We think that this makes it a viable candidate for use in the real world. [15]

Search and replace


  • regexxer is a nifty GUI search/replace tool featuring Perl-style regular expressions


  • Hyperscan is a high-performance multiple regex matching library. It follows the regular expression syntax of the commonly-used libpcre library, yet functions as a standalone library with its own API written in C. Hyperscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions, as well as matching of regular expressions across streams of data. [16]

sed

echo "test string oldWord yadayada" | sed 's/oldWord/newWord/g'
sed -i 's/search/replace#/' filename
sed -i 's#test#replace#' filename
  # in-place editing of a file, alternative separators
find . -name "*.html" -exec sed -i "s/oldWord/newWord/g" '{}' \;
  replace text in multiple files [18]
 echo "<a href="index.html"><img src="logo.svg" id="site-logo"></a>
          <h1>Site Title</h1>" | sed 'N; s@</a>\
          <h1>Site Title</h1>@\
          <h1>Site Title</h1></a>@g'
 multiline replacement



awk

  • https://en.wikipedia.org/wiki/AWK - a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems.The AWK language is a data-driven scripting language consisting of a set of actions to be taken against streams of textual data – either run directly on files or used as part of a pipeline – for purposes of extracting or transforming text, such as producing formatted reports. The language extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and regular expressions. While AWK has a limited intended application domain and was especially designed to support one-liner programs, the language is Turing-complete, and even the early Bell Labs users of AWK often wrote well-structured large AWK programs AWK was created at Bell Labs in the 1970s, and its name is derived from the surnames of its authors—Alfred Aho, Peter Weinberger, and Brian Kernighan. The acronym is pronounced the same as the name of the bird auk (which acts as an emblem of the language such as on The AWK Programming Language book cover – the book is often referred to by the abbreviation TAPL). When written in all lowercase letters, as awk, it refers to the Unix or Plan 9 program that runs scripts written in the AWK programming language.


  • Gawk - If you are like many computer users, you would frequently like to make changes in various text files wherever certain patterns appear, or extract data from parts of certain lines while discarding the rest. To write a program to do this in a language such as C or Pascal is a time-consuming inconvenience that may take many lines of code. The job is easy with awk, especially the GNU implementation: gawk.The awk utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs with just a few lines of code.


awk \'{print $NF;}\
  # "GG      TC    CC" to "G G      T C       C C"
awk ' { gsub("GG","G G");gsub("TC","T C");gsub("CC","C C");print } ' file
  # [21]







  • https://github.com/ezrosent/frawk - a small programming language for writing short programs processing textual data. To a first approximation, it is an implementation of the AWK language; many common Awk programs produce equivalent output when passed to frawk. You might be interested in frawk if you want your scripts to handle escaped CSV/TSV like standard Awk fields, or if you want your scripts to execute faster.

sd


bsed

  • https://github.com/andrewbihl/bsed - Simple, english syntax on top of Perl text processing. Designed to replace simple uses of sed/grep/AWK/Perl. Bsed is a stream editor. In contrast to interactive text editors, stream editors process text in one go, applying a command to an entire input stream or open file. [27]

Library



Hyperscan

  • Hyperscan - a high-performance multiple regex matching library available as open source with a C API. Hyperscan uses hybrid automata techniques to allow simultaneous matching of large numbers of regular expressions across streams of data.

Other

  • Regex Crossword is a crossword puzzle game, where the crossword clues are defined using regular expressions. [28] [29]