Files

From Things and Stuff Wiki
Jump to navigation Jump to search


General

  • https://en.wikipedia.org/wiki/File_format - a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open.Some file formats are designed for very particular types of data: PNG files, for example, store bitmapped images using lossless data compression. Other file formats, however, are designed for storage of several different types of data: the Ogg format can act as a container for different types of multimedia including any combination of audio and video, with or without text (such as subtitles), and metadata. A text file can contain any stream of characters, including possible control characters, and is encoded in one of various character encoding schemes. Some file formats, such as HTML, scalable vector graphics, and the source code of computer software are text files with defined syntaxes that allow them to be used for specific purposes.


  • https://en.wikipedia.org/wiki/Digital_container_format - or wrapper format is a metafile format whose specification describes how different elements of data and metadata coexist in a computer file. Among the earliest cross-platform container formats were Distinguished Encoding Rules and the 1985 Interchange File Format. Containers are frequently used in multimedia applications.


  • Best File Formats for Archiving - This guide compares common file formats for the purpose of digital archiving and preservation. It also discusses how to choose a resolution for images, and how to choose a sampling rate and a bit rate for MP3 audio files. Disclaimer: All of the below is provided as my personal opinion only, without guarantee for completeness or correctness. The current version of this text is 2018-01-13.



Types


  • https://en.wikipedia.org/wiki/Filename_extension - or file extension is a suffix to the name of a computer file (for example, .txt, .docx, .md). The extension indicates a characteristic of the file contents or its intended use. A filename extension is typically delimited from the rest of the filename with a full stop (period), but in some systems it is separated with spaces. Other extension formats include dashes and/or underscores on early versions of Linux and some versions of IBM AIX. Some file systems implement filename extensions as a feature of the file system itself and may limit the length and format of the extension, while others treat filename extensions as part of the filename without special distinction.


  • https://en.wikipedia.org/wiki/Media_type - formerly known as MIME type is a two-part identifier for file formats and format contents transmitted on the Internet. The Internet Assigned Numbers Authority (IANA) is the official authority for the standardization and publication of these classifications. Media types were originally defined in Request for Comments 2045 in November 1996 as a part of MIME (Multipurpose Internet Mail Extensions) specification, for denoting type of email message content and attachments; hence the original name, MIME type. Media types are also used by other internet protocols such as HTTP and document file formats such as HTML, for similar purpose.


  • https://wiki.archlinux.org/index.php/Default_Applications - Programs implement default application associations in different ways, while command-line programs traditionally use environment variables, graphical applications tend to use XDG MIME Applications through either the GIO API, the Qt API or by executing /usr/bin/xdg-open, which is part of xdg-utils. Because xdg-open and XDG MIME Applications are quite complex various alternative resource openers were developed. These alternatives replace /usr/bin/xdg-open, which obviously only affects applications that actually use the executable. The following table lists some example applications for each method.
  • https://en.wikipedia.org/wiki/File_association - associates a file with an application capable of opening that file. More commonly, a file association associates a class of files (usually determined by their filename extension, such as .txt) with a corresponding application (such as a text editor).


  • https://github.com/mime-types/mime-types-data - provides a registry for information about MIME media type definitions. It can be used with the Ruby mime-types library or other software to determine defined filename extensions for MIME types, or to use filename extensions to look up the likely MIME type definitions.


  • desktop-entry-spec - The desktop entry specification describes desktop entries: files describing information about an application such as the name, icon, and description. These files are used for application launchers and for creating menus of applications that can be launched.


Configuration files:

/usr/share/applications/defaults.list
  # global
~/.local/share/applications/defaults.list
  # per user, overrides global

~/.local/share/applications/mimeapps.list

~/.local/share/applications/mimeinfo.cache


[Default Applications]
mimetype=desktopfile1;desktopfile2;...;desktopfileN


xdg-mime default firefox.desktop x-scheme-handler/https x-scheme-handler/http
  # add two lines to ~/.config/mimeapps.list



CLI

xdg-mime default Thunar.desktop inode/directory
  # make Thunar the default file-browser

xdg-mime default xpdf.desktop application/pdf
  # make xpdf the default PDF viewer


xdg-settings get default-web-browser
  # returns what app opens http/https



/usr/bin/vendor_perl/mimeopen -d $file.pdf


  • xdg-settings - get and set various desktop environment settings



  • mimeo - uses MIME-type file associations to determine which application should be used to open a file. It can launch files or print information such as the command that it would use, the detected MIME-type, etc. It is also possible to use regular expressions to associate arguments with applications. The most common example is to open URLs in browsers or associate file extensions with applications irrespective of their MIME-type.Mimeo tries to adhere to the relevant standards on freedesktop.org and should therefore be compatible with other applications that set or read MIME-type associations, e.g. PCManFM.


GUI


 kcmshell5 filetypes

Metadata


Container

  • https://en.wikipedia.org/wiki/Digital_container_format - or wrapper format is a metafile format whose specification describes how different elements of data and metadata coexist in a computer file. Among the earliest cross-platform container formats were Distinguished Encoding Rules and the 1985 Interchange File Format. Containers are frequently used in multimedia applications.

IFF / RIFF



Other

  • Kaitai Struct - declarative binary format parsing language, a declarative language used to describe various binary data structures, laid out in files or in memory: i.e. binary file formats, network stream packet formats, etc.The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read the described data structure from a file or stream and give access to it in a nice, easy-to-comprehend API.



Files & directories

  • https://en.wikipedia.org/wiki/Directory_(computing) - a file system cataloging structure which contains references to other computer files, and possibly other directories. On many computers, directories are known as folders, or drawers to provide some relevancy to a workbench or the traditional office file cabinet. Files are organized by storing related files in the same directory. In a hierarchical filesystem (that is, one in which files and directories are organized in a manner that resembles a tree), a directory contained inside another directory is called a subdirectory. The terms parent and child are often used to describe the relationship between a subdirectory and the directory in which it is cataloged, the latter being the parent. The top-most directory in such a filesystem, which does not have a parent of its own, is called the root directory.


  • https://en.wikipedia.org/wiki/Unix_file_types - For normal files in the file system, Unix does not impose or provide any internal file structure. This implies that from the point of view of the operating system, there is only one file type. The structure and interpretation thereof is entirely dependent on how the file is interpreted by software. Unix does however have some special files. These special files can be identified by the ls -l command which displays the type of the file in the first alphabetic letter of the file system permissions field. A normal (regular) file is indicated by a hyphen-minus '-'.


  • https://en.wikipedia.org/wiki/File_descriptor - In the traditional implementation of Unix, file descriptors index into a per-process file descriptor table maintained by the kernel, that in turn indexes into a system-wide table of files opened by all processes, called the file table. This table records the mode with which the file (or other resource) has been opened: for reading, writing, appending, reading and writing, and possibly other modes. It also indexes into a third table called the inode table that describes the actual underlying files. To perform input or output, the process passes the file descriptor to the kernel through a system call, and the kernel will access the file on behalf of the process. The process does not have direct access to the file or inode tables.

On Linux, the set of file descriptors open in a process can be accessed under the path /proc/PID/fd/, where PID is the process identifier. In Unix-like systems, file descriptors can refer to any Unix file type named in a file system. As well as regular files, this includes directories, block and character devices (also called "special files"), Unix domain sockets, and named pipes. File descriptors can also refer to other objects that do not normally exist in the file system, such as anonymous pipes and network sockets.

The FILE data structure in the C standard I/O library usually includes a low level file descriptor for the object in question on Unix-like systems. The overall data structure provides additional abstraction and is instead known as a file handle.






Directory structure

See LSB, etc.


  • https://en.wikipedia.org/wiki/stat_(system_call) - a Unix system call that returns file attributes about an inode. The semantics of stat() vary between operating systems. stat appeared in Version 1 Unix. It is among the few original Unix system calls to change, with Version 4's addition of group permissions and larger file size. As an example, Unix command ls uses this system call to retrieve information on files that includes:
  • atime: time of last access (ls -lu)
  • mtime: time of last modification (ls -l)
  • ctime: time of last status change (ls -lc)





  • https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard - defines the directory structure and directory contents in Unix[citation needed] and Unix-like operating systems. It is maintained by the Linux Foundation. The latest version is 3.0, released on 3 June 2015.[1] Currently it is only used by Linux distributions.




  • rmshit - Keep $HOME or other dir clean from unwanted tempfiles, configs and other crap you'll never use that's autocreated upon execution of bad behaving applications


  • BleachBit quickly frees disk space and tirelessly guards your privacy. Free cache, delete cookies, clear Internet history, shred temporary files, delete logs, and discard junk you didn't know was there. Designed for Linux and Windows systems, it wipes clean a thousand applications including Firefox, Internet Explorer, Adobe Flash, Google Chrome, Opera, Safari,and more. Beyond simply deleting files, BleachBit includes advanced features such as shredding files to prevent recovery, wiping free disk space to hide traces of files deleted by other applications, and vacuuming Firefox to make it faster.
 bleachbit --clean system.cache system.localizations system.trash system.tmp


File permissions

stat -c '%A %a %n' *
  list permissions in octal

ulimit

  • ulimit - provides control over the resources available to the shell and to processes started by it, on systems that allow such control.
  • limits.conf - configuration file for the pam_limits module, /etc/security/limits.conf

chmod

chmod
  # change file mode bits
 # Before: drwxr-xr-x 6 archie users  4096 Jul  5 17:37 Documents
chmod g= Documents
chmod o= Documents
  # After: drwx------ 6 archie users  4096 Jul  6 17:32 Documents

Users;

  • u - the user who owns the file.
  • g - other users who are in the file's group.
  • o - all other users.
  • a - all users; the same as ugo.

Operation;

  • + - to add the permissions to whatever permissions the users already have for the file.
  • - - to remove the permissions from whatever permissions the users already have for the file.
  • = - to make the permissions the only permissions that the users have for the file.

Permissions; r - the permission the users have to read the file. w - the permission the users have to write to the file. x - the permission the users have to execute the file, or search it if it is a directory.

chown

chown -R user:group .
  change all files and directories to a specific use and group [8]

Access Control Lists

Partition must be mounted with acl option. "Operation not supported" from setfacl means this is not so.

mount -o remount,acl /

Set in /etc/fstab for persistence across boots.

Opening a directory requires execute permission.

setfacl -m "u:username:permissions"
setfacl -m "u:uid:permissions"
  add permissions for user

setfacl -m "g:groupname:permissions"
setfacl -m "g:gid:permissions"
  add permissions for group

setfacl -m "u:user:rwx" file
  add read, write, execure perms for user for file
setfacl -Rm "u:user:rw" /dir
  add recursive read, write perms for user for dir
setfacl -Rdm "u:user:rw" /dir
  add recursive read, write perms for user for dir and make them default for future changes
getfacl ./
  shows access control list for directory or file


Copying

See also Sharing, Backup

cp

cp - copy files and directories
cp [option]… [-T] source dest


cp target --parents a/b/c existing_dir
  # copies the file `a/b/c' to `existing_dir/a/b/c', creating any missing intermediate directories. [10]


cp ubuntu.img /dev/sdb && sync


scp

scp -P 2264 foobar.txt your_username@remotehost.edu:/some/remote/directory
scp -rP 2264 folder your_username@remotehost.edu:/some/remote/directory
  • dcp is a distributed file copy program that automatically distributes and dynamically balances work equally across nodes in a large distributed system without centralized state. http://filecopy.org/

rsync

  • rsync(1) - a fast, versatile, remote (and local) file-copying tool.
  • rsync is a software application and network protocol for Unix-like systems with ports to Windows that synchronizes files and directories from one location to another while minimizing data transfer by using delta encoding when appropriate. Quoting the official website: "rsync is a file transfer program for Unix systems. rsync uses the 'rsync algorithm' which provides a very fast method for bringing remote files into sync." An important feature of rsync not found in most similar programs/protocols is that the mirroring takes place with only one transmission in each direction. rsync can copy or display directory contents and copy files, optionally using compression and recursion.


  • Easy Automated Snapshot-Style Backups with Rsync - This document describes a method for generating automatic rotating "snapshot"-style backups on a Unix-based system, with specific examples drawn from the author's GNU/Linux experience. Snapshot backups are a feature of some high-end industrial file servers; they create the illusion of multiple, full backups per day without the space or processing overhead. All of the snapshots are read-only, and are accessible directly by users as special system directories. It is often possible to store several hours, days, and even weeks' worth of snapshots with slightly more than 2x storage. This method, while not as space-efficient as some of the proprietary technologies (which, using special copy-on-write filesystems, can operate on slightly more than 1x storage), makes use of only standard file utilities and the common rsync program, which is installed by default on most Linux distributions. Properly configured, the method can also protect against hard disk failure, root compromises, or even back up a network of heterogeneous desktops automatically.


"Unfortunately “--sparse” and “--inplace” cannot be used together. Solution: When copying the file the first time, which means it does not exist on the target server use “rsync --sparse“. This will create a sparse file on the target server and copies only the used data of the sparse file. When the file already exists on the target server and you only want to update it use “rsync --inplace“. This will only transmit the changed blocks and can also append to the existing sparse file."



rsync [OPTION...] SRC... [DEST]
  # copy source file or directory to destination
rsync local-file user@remote-host:remote-file

rsync -e='ssh -p8023' file remotehost:~/
  # non-standard remote shell command, copy to remote users home directory
rsync -r --partial --progress srcdirectory destdirectory
  # recursive
  # --partial - resume partial files,
  # --progress - show progress during transfer, equiv to --info=flist2,name,progress
  # -P                          same as --partial --progress

rsync -rP srcdirectory/ destdirectory
  # recursive, resume partial files with progress bar, don't copy root source folder


 rsync -ahrH --info=progress2 --no-i-r --partial --partial-dir=.rsync-partial

-a, --archive
  #  archive mode; equals -rlptgoD (no -A,-X,-H)
    # recursive, links, preserve permissions/times/groups/owner/device files.

-r, --recursive             recurse into directories
-l, --links                 copy symlinks as symlinks
-p, --perms                 preserve permissions
-t, --times                 preserve modification times
-g, --group                 preserve group
-o, --owner                 preserve owner (super-user only)
-D  --devices --specials.
    --devices               preserve device files (super-user only)
    --specials              preserve special files
 
-h, --human-readable        output numbers in a human-readable format

-v, --verbose               list files transfered
-vv

    --info=progress2       displays where in the list of files to be transfered rsync is
    --no-i-r               disable incremental recursion so rsync create list of all files
    --partial              resume partially transfered files
    --partial-dir=.tmpdir  use a dir to separate partially transfered files
    --inplace               This option is useful for transferring large files with block-based changes or appended data, and also on systems that are disk bound, not network bound. It can also help keep a copy-on-write filesystem snapshot from diverging the entire contents of a file that only has minor changes.
    --no-whole-file         incremental delta-xfer

-A, --acls                  preserve ACLs (implies -p, save permissions)
-X, --xattrs                preserve extended attributes
-H, --hard-links            preserve hard links

    --exclude-from=FILE     read exclude patterns from FILE
    --sparse                handle sparse files efficiently
-W, --whole-file            copy files whole (w/o delta-xfer algorithm)
-d, --dirs                  transfer directories without recursing
    --numeric-ids           transfer numeric group and user IDs rather than mapping user and group name
-x, --one-file-system       don't cross filesystem boundaries (mount points). "/*" as dest would bypass -x
-xx                         as above, but don't create empty directories for mount points

    --delete                delete extraneous files from dest dirs
    --delete-after          receiver deletes after transfer, not during
    --delete-excluded       also delete excluded files from dest dirs

    --ignore-errors         go ahead even when there are IO errors instead of regarding as fatal
    --stats                 give some file-transfer stats
-z                          compress files during transfer



rsync -aAXv --exclude={/dev/*,/proc/*,/sys/*,/tmp/*,/run/*,/mnt/*,/media/*,/lost+found} /* /path/to/backup/directory
  # archive, with permissions/ACL, and attributes, exclude directories

rsync -ahrHP --info=progress2 --no-i-r --partial-dir=.rsync-partial
  # archive, human readable, recursive, don't flatten hard links, progress bar, no incremental recursion, store partial transfers in temp dir



#!/bin/sh
if [ $# -lt 1 ]; then 
    echo "No destination defined. Usage: $0 destination" >&2
    exit 1
elif [ $# -gt 1 ]; then
    echo "Too many arguments. Usage: $0 destination" >&2
    exit 1
fi

if [ $2 = "new" ]; then
   SPARCEINPLACE="--sparse"
elif [ $2 = "rerun" ]; then
   SPARCEINPLACE="--inplace"
fi

RSYNC="ionice -c3 rsync"
RSYNC_ARGS="-aAXv --no-whole-file --stats --ignore-errors --human-readable --delete"
# -a = --recursive, --links, --perms, --times, --group, --owner, --devices, --specials
# -A = --acls, -X = --xattrs, -v = --verbose
# --no-whole-file = delta-xfer algorithm
# --delete = delete extraneous destination files
START=$(date +%s)
# SOURCES="/dir1 /dir2 /file3"
TARGET=$1

echo "Executing dry-run to see how many files must be transferred..."
TODO=$(${RSYNC} --dry-run ${RSYNC_ARGS} ${SOURCES} ${TARGET}|grep "^Number of files transferred"|awk '{print $5}')

${RSYNC} ${RSYNC_ARGS} ${SPARCEINPLACE} --exclude={/dev/*,/proc/*,/sys/*,/tmp/*,/run/*,/mnt/*,/media/*,/lost+found} \
--exclude={/home/*/.gvfs,/home/*/.thumbnails,/home/*/.cache,/home/*/.cache/mozilla/*,/home/*/.cache/chromium/*} \
--exclude={/home/*/.local/share/Trash/,/home/*/.macromedia,/var/lib/mpd/*,/var/lib/pacman/sync/*, /var/tmp*} \
 / $TARGET | pv -l -e -p -s "$TODO"

FINISH=$(date +%s)
echo "-- Total backup time: $(( ($FINISH-$START) / 60 ))m, $(( ($FINISH-$START) % 60 ))s"
touch $1/backup-from-$(date '+%a_%d_%m_%Y_%H%M%S')
# after refining excluded, --delete-excluded
0 5 * * * rsync-backup.sh /path/to/backup/directory rerun


  • https://github.com/kristapsdz/openrsync - a clean-room implementation of rsync with a BSD (ISC) license. It's compatible with a modern rsync (3.1.3 is used for testing, but any supporting protocol 27 will do), but accepts only a subset of rsync's command-line arguments. At this time, openrsync runs only on OpenBSD. If you want to port to your system (e.g. Linux, FreeBSD), read the Portability section first. [13]


Daemon
rsync --daemon
  # run as a daemon
lsyncd
Services

dd

  • dd - Copy a file, converting and formatting according to the options.
    • dd is a common Unix program whose primary purpose is the low-level copying and conversion of raw data.


if stands for input file and of for output file.

dd if=/dev/sda of=/mnt/sdb1/backup.img
  create a backup

dd if=/mnt/sdb1/backup.img of=/dev/sda
  restore a backup

dd if=/dev/sdb of=/dev/sdc
  clone a drive

dd if=/dev/sdb | ssh root@target "(cat >backup.img)"
  backup over network

dd if=/dev/cdrom of=cdimage.iso
  backup a cd
dd if=/dev/sr0 of=myCD.iso bs=2048 conv=noerror,sync
  create an ISO disk image from a CD-ROM.

dd if=/dev/sda2 of=/dev/sdb2 bs=4096 conv=noerror
  Clone one partition to another

dd if=/dev/ad0 of=/dev/ad1 bs=1M conv=noerror
  Clone a hard disk "ad0" to "ad1".

dd if=/dev/zero bs=1024 count=1000000 of=file_1GB

dd if=file_1GB of=/dev/null bs=64k
  drive benchmark test and analyze the sequential read and write performance for 1024 byte blocks


dd if=local-file | ssh <host> "dd of=/path/to/remote-file"

hot-clone

tar pipe

  • The Tar Pipe - It basically means "copy the src directory to dst, preserving permissions and other special stuff." It does this by firing up two tars – one tarring up src, the other untarring to dst, and wiring them together. [15]

File information

ls

ls
  list files in current directory
ls -l
  long list, each file on a new line with information

ls *
  files in directory and immediate subdiretories

ls -a
  # show hidden files
ls  -A
  # show hidden files, exclude . and ..
ls -m1
  -m fill width with a comma separated list of entries ??
ls --format single-column
  column of names only
ls -l | grep - | awk '{print $9}'
  using awk to show the 9th word (name). strips colour.
ls -l | cut -f9 -s -d" "
  using cut to cut from the 9th word, using space as a delimiter. strips colour.
ls | cat
  # separate words into new lines
  • Greg's Wiki: ParsingLs - Why you shouldn't parse the output of ls(1)




exa


even-better-ls

stat

stat .
  display file or file system status
stat -c "%n %a" * | column -t
  directory files + octal

chattr/lsattr

lsattr .

append only (a), compressed (c), no dump (d), immutable (i), data journaling (j), secure deletion (s), no tail-merging (t), undeletable (u), no atime updates (A), synchronous directory updates (D), synchronous updates (S), and top of directory hierarchy (T).

file

file [filename]

binwalk

Directory navigation

pwd

cd

cd change/directory/path

Autojump


z
fasd
v def conf       =>     vim /some/awkward/path/to/type/default.conf
j abc            =>     cd /hell/of/a/awkward/path/to/get/to/abcdef
m movie          =>     mplayer /whatever/whatever/whatever/awesome_movie.mp4
o eng paper      =>     xdg-open /you/dont/remember/where/english_paper.pdf
vim `f rc lo`    =>     vim /etc/rc.local
vim `f rc conf`  =>     vim /etc/rc.conf

alias defaults;

alias a='fasd -a'        # any
alias s='fasd -si'       # show / search / select
alias d='fasd -d'        # directory
alias f='fasd -f'        # file
alias sd='fasd -sid'     # interactive directory selection
alias sf='fasd -sif'     # interactive file selection
alias z='fasd_cd -d'     # cd, same functionality as j in autojump
alias zz='fasd_cd -d -i' # cd with interactive selection
autojump
pazi
zoxide
z.lua
zfm
  • https://github.com/pabloariasal/zfm - a minimal command line bookmark manager for zsh built on top of fzf. It lets you bookmark files and directories in your system and rapidly access them.It's intended to be a less intrusive alternative to z, autojump or fasd that doesn't pollute your prompt command or create bookmarks behind the scenes: you have full control over what gets bookmarked and when, like bookmarks on a web browser.
zz
  • zz - a smart and efficient directory changer [19]
cx

to sort







  • https://github.com/jamesob/desk - Lightweight workspace manager for the shell. Desk makes it easy to flip back and forth between different project contexts in your favorite shell. [24]




Creating files

touch

touch filename
  create a file or update timestamp

mkdir

mkdir directory
mkdir directory -p
  no error if existing, make parent directories as needed

ln

symlink

ln -s {target-filename}
ln -s {target-filename} {symbolic-filename}
  create soft link

fallocate

  • fallocate - used to manipulate the allocated disk space for a file, either to deallocate or preallocate it. For filesystems which support the fallocate system call, preallocation is done quickly by allocating blocks and marking them as uninitialized, requiring no IO to the data blocks. This is much faster than creating a file by filling it with zeroes. The exit code returned by fallocate is 0 on success and 1 on failure.

Test files

yes abcdefghijklmnopqrstuvwxyz0123456789 > largefile
  # about 10 times faster than running dd if=/dev/urandom of=largefile, about as fast as using dd if=/dev/zero of=filename bs=1M. [25]

Editing files

See Vim, Emacs

echo "hello" >> greetings.txt

cat temp.txt >> data.txt
  To append the contents of the file temp.txt to file data.txt

date >> dates.txt
  To append the current date/time timestamp to the file dates.txt


Patching files

  • patch(1) - apply a diff file to an original



  • https://github.com/loyso/Scarab - A system to patch your content files. Basically, Scarab is a C++ library to build and apply binary diff package for directories. Scarab is built to be embeddable into your application or service.

Viewing files

<filename

cat

cat filename
  output file to screen
cat -n filename
  output file to screen w/ line numbers
cat filename1 filename2
  output two files (concatinate)
cat filename1 > filename2
  overwrite filename2 with filename1
cat filename1 >> filename2
  append filename1 to filename2
cat filename{1,2} > filename2
  add filename1 and filename2 together into filename3

head

head filename
  top 10 lines of file
head -23 filename
  top 23 lines of file

tail

tail filename
  bottom 10 lines of file
tail -23 filename
  bottom 23 lines of file

Other

sed -n 20,30p filename
  print lines 20..30 of file [27]


bat

look

  • look(1) - displays any lines in file which contain string as a prefix. As look performs a binary search, the lines in file must be sorted. If file is not specified, the file /usr/share/dict/words is used, only alphanumeric characters are compared and the case of alphabetic charac- ters is ignored.

twf

File pagers

more

more is a filter for paging through text one screenful at a time. This version is especially primitive. Users should realize that less(1) provides more(1) emulation plus extensive enhancements.

less

less is an improvement on more and a funny name.

most

vimpager

other

Moving files

mv

mv position1 ~/position2
  basic move




renameutils

  • renameutils - a set of programs designed to make renaming of files faster and less cumbersome. The file renaming utilities consists of five programs - qmv, qcp, imv, icp and deurlname.

The qmv ("quick move") program allows file names to be edited in a text editor. The names of all files in a directory are written to a text file, which is then edited by the user. The text file is read and parsed, and the changes are applied to the files.

The qcp ("quick cp") program works like qmv, but copies files instead of moving them.

The imv ("interactive move") program, is trivial but useful when you are too lazy to type (or even complete) the name of the file to rename twice. It allows a file name to be edited in the terminal using the GNU Readline library. icp copies files.

The deurlname program removes URL encoded characters (such as %20 representing space) from file names. Some programs such as w3m tend to keep those characters encoded in saved files.

Métamorphose

  • Métamorphose - a batch renamer, a program to rename large sets of files and folders quickly and easily.With its extensive feature set, flexibility and powerful interface, Métamorphose is a profesional's tool. A must-have for those that need to rename many files and/or folders on a regular basis.In addition to general usage renaming, it is very useful for photo and music collections, webmasters, programmers, legal and clerical, etc.

nmly

Detox

  • Detox - a utility designed to clean up filenames. It replaces difficult to work with characters, such as spaces, with standard equivalents. It will also clean up filenames with UTF-8 or Latin-1 (or CP-1252) characters in them.

Removing files

rm

rm file

rm -rf directory
find * -maxdepth 0 -name 'keepthis' -prune -o -exec rm -rf '{}' ';'
  # remove all but keepthis [31]
fd somefiletype | xargs rm
  # remove all files returned by fd

scrub

Managing files

See File managers

Finding files

See Regex, Search

ls | sort -f | uniq -i -d
  # list duplicate files taking upper/lower case into account [32]
whereis

GNU findutils

GNU Find Utilities are the basic directory searching utilities of the GNU operating system. These programs are typically used in conjunction with other programs to provide modular and powerful directory search and file locating capabilities to other commands.

The tools supplied with this package are:

  • find - search for files in a directory hierarchy
  • locate - list files in databases that match a pattern
  • updatedb - update a file name database
  • xargs - build and execute command lines from standard input
find
find /usr/share -name README
find ~/Journalism -name '*.txt'
find ~/Programming -path '*/src/*.c'
find ~/Journalism -name '*.txt' -exec cat {} ;
  exec command on result path (aliases don't work in exec argument) 
find ~/Images/Screenshots -size +500k -iname '*.jpg'
find ~/Journalism -name '*.txt' -print0 | xargs -0 cat   (faster than above)

find / -group [group]
find / -user [user]
find . -mtime -[n]
  File's data was last modified n*24 hours ago

find . -mtime +5 -exec rm {} \;
  remove files older than 5 days
find . -type f -links +1
  list hard links



xargs

xargs reads items from the standard input, delimited by blanks (which can be protected with double or single quotes or a backslash) or newlines, and executes the command (default is /bin/echo) one or more times with any initial-arguments followed by items read from standard input. Blank lines on the standard input are ignored.

Because Unix filenames can contain blanks and newlines, this default behaviour is often problematic; filenames containing blanks and/or newlines are in‐correctly processed by xargs. In these situations it is better to use the -0 option, which prevents such problems. When using this option you will need to ensure that the program which produces the input for xargs also uses a null character as a separator. If that program is GNU find for example, the -print0 option does this for you.

If any invocation of the command exits with a status of 255, xargs will stop immediately without reading any further input. An error message is issued on stderr when this happens.

echo 'one two three' | xargs mkdir
ls
$ one two three

echo 'one two three' | xargs -t rm
$ rm one two three
find . -name '*.py' | xargs wc -l
  # Recursively find all Python files and count the number of lines

find . -name '*~' | xargs -0 rm
  # Recursively find all Emacs backup files and remove them, alloging for filenames with whitespace

find . -name '*.py' | xargs grep 'import'
  # Recursively find all Python files and search them for the word ‘import’
find . -type f -print0 | xargs -0 stat -c "%y %s %n"
  # prints permissions in octal (0775, etc.)
find . -name "*.ext" -print0 | xargs -n 1000 -I '{}' mv '{}' ../..
  # for a number of files to high for the mv command (move to two directories up)

bfs

  • https://github.com/tavianator/bfs - a variant of the UNIX find command that operates breadth-first rather than depth-first. It is otherwise intended to be compatible with many versions of find, including POSIX find, GNU find, {Free,Open,Net}BSD find, macOS find

fd

  • https://github.com/sharkdp/fd - a simple, fast and user-friendly alternative to find. While it does not seek to mirror all of find's powerful functionality, it provides sensible (opinionated) defaults for 80% of the use cases.

hf

  • https://github.com/hugows/hf - a command line utility to quickly find files and execute a command - something like Helm/Anything/CtrlP for the terminal. It tries to find the best match, like other fuzzy finders (Sublime, ido, Helm). [34]


locate

locate fileordirectory

locate /

locate / | xargs -i echo 'test -f "{}" && echo "{}"' | sh
  # only files

locate / | xargs -i echo 'test -f "{}" && echo "{}"' | sh
  # only directories [35]

"Although in other distros locate and updatedb are in the findutils package, they are no longer present in Arch's package. To use it, install the mlocate package. mlocate is a newer implementation of the tool, but is used in exactly the same way."

mlocate is a locate/updatedb implementation. The 'm' stands for "merging": updatedb reuses the existing database to avoid rereading most of the file system, which makes updatedb faster and does not trash the system caches as much. The locate(1) utility is intended to be completely compatible to slocate. It also attempts to be compatible to GNU locate, when it does not conflict with slocate compatibility.


Before locate can be used, the database will need to be created. To do this, simply run updatedb as root.

sudo updatedb
  # creates/updates a db file of paths that is queried by locate

/etc/updatedb.conf

PRUNE_BIND_MOUNTS = "yes"
PRUNEFS = "9p afs anon_inodefs auto autofs bdev binfmt_misc cgroup cifs coda configfs cpuset cramfs debugfs devpts devtmpfs ecryptfs exofs ftpfs fuse fuse.encfs fuse.sshfs fusectl gfs gfs2 hugetlbfs inotifyfs iso9660 jffs2 lustre mqueue ncpfs nfs nfs4 nfsd pipefs proc ramfs rootfs rpc_pipefs securityfs selinuxfs sfs shfs smbfs sockfs sshfs sysfs tmpfs ubifs udf usbfs vboxsf"
PRUNENAMES = ".git .hg .svn"
PRUNEPATHS = "/afs /media /mnt /net /sfs /tmp /udev /var/cache /var/lib/pacman/local /var/lock /var/run /var/spool /var/tmp"

/var/lib/mlocate/mlocate.db


strings /var/lib/mlocate/mlocate.db | grep -E '^/.*config'
  # query db directly, needs sudo or sudoers or acl

fsearch


fselect

to sort


Diffing files

File and directory comparison.




  • diffoscope - in-depth comparison of files, archives, and directories. diffoscope will try to get to the bottom of what makes files or directories different. It will recursively unpack archives of many kinds and transform various binary formats into more human readable form to compare them. It can compare two tarballs, ISO images, or PDF just as easily.


  • https://github.com/renardchien/Differnator - a tool for comparing two similar codebases leveraging diff command operations and other comparative operations. Differnator can do large scale comparisons across several directories or numerous pairs of similar codebases.


GUI





  • xxdiff - a graphical file and directories comparator and merge tool, provided under the GNU GPL open source license. It has reached stable state, and is known to run on many popular unices, including IRIX, Linux, Solaris, HP/UX, DEC Tru64. It has been deployed inside many large organizations and is being actively maintained by its author (Martin Blais).


  • Diffuse - graphical tool for merging and comparing text files


Finding duplicate files

  • md5deep and hashdeep - md5deep is a set of programs to compute MD5, SHA-1, SHA-256, Tiger, or Whirlpool message digests on an arbitrary number of files. md5deep is similar to the md5sum program found in the GNU Coreutils package, but has the following additional features: Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory. Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match. Hashes sets can be drawn from Encase, the National Software Reference Library, iLook Investigator, Hashkeeper, md5sum, BSD md5, and other generic hash generating programs. Users are welcome to add functionality to read other formats too! Time estimation - md5deep can produce a time estimate when it's processing very large files. Piecewise hashing - Hash input files in arbitrary sized blocks File type mode - md5deep can process only files of a certain type, such as regular files, block devices, etc. hashdeep is a program to compute, match, and audit hashsets. With traditional matching, programs report if an input file matched one in a set of knows or if the input file did not match. It's hard to get a complete sense of the state of the input files compared to the set of knowns. It's possible to have matched files, missing files, files that have moved in the set, and to find new files not in the set. Hashdeep can report all of these conditions. It can even spot hash collisions, when an input file matches a known file in one hash algorithm but not in others. The results are displayed in an audit report.



  • dupeGuru - a cross-platform (Linux, OS X, Windows) GUI tool to find duplicate files in a system. It’s written mostly in Python 3 and has the peculiarity of using multiple GUI toolkits, all using the same core Python code. On OS X, the UI layer is written in Objective-C and uses Cocoa. On Linux 7 Windows, it’s written in Python and uses Qt5. dupeGuru is a tool to find duplicate files on your computer. It can scan either filenames or contents. The filename scan features a fuzzy matching algorithm that can find duplicate filenames even when they are not exactly the same. dupeGuru runs on Mac OS X and Linux. dupeGuru is efficient. Find your duplicate files in minutes, thanks to its quick fuzzy matching algorithm. dupeGuru not only finds filenames that are the same, but it also finds similar filenames.



  • https://github.com/jbruchon/jdupes - A powerful duplicate file finder and an enhanced fork of 'fdupes'. What jdupes is not: a similar (but not identical) file finding tool.



  • Duff - a Unix command-line utility for quickly finding duplicates in a given set of files.Duff is written in C, uses gettext where available, is licensed under the zlib/libpng license and should compile on most modern Unices.




  • https://github.com/IgnorantGuru/rmdupe - uses standard linux commands to search within specified folders for duplicate files, regardless of filename or extension. Before duplicate candidates are removed they are compared byte-for-byte. rmdupe can also check duplicates against one or more reference folders, can trash files instead of removing them, allows for a custom removal command, and can limit its search to files of specified size. rmdupe includes a simulation mode which reports what will be done for a given command without actually removing any files.





  • https://github.com/l00g33k/dirtree - This is a command line directory utility that works on both Windows and Linux. It scans the directory information and store in a file. This allows two computers to be scanned at a different time and even in two different country and allows the comparison to be carried out at a different time and place. This allows terabytes of disk with millions…





Archiving

ar

  • https://en.wikipedia.org/wiki/ar_(Unix) - a Unix utility that maintains groups of files as a single archive file. Today, ar is generally used only to create and update static library files that the link editor or linker uses and for generating .deb packages for the Debian family; it can be used to create archives for any purpose, but has been largely replaced by tar for purposes other than static libraries. An implementation of ar is included as one of the GNU Binutils.


shar

  • http://en.wikipedia.org/wiki/shar - an abbreviation of shell archive, is an archive format. A shar file is a shell script, and executing it will recreate the files. This is a type of self-extracting archive file. It can be created with the Unix shar utility. To extract the files, only the standard Unix Bourne shell sh is usually required. Note that shar is not specified by the Single Unix Specification, so it is not formally a component of Unix, but a legacy utility.

unshar programs have been written for other operating systems but are not always reliable; shar files are shell scripts and can theoretically do anything that a shell script can do (including using incompatible features of enhanced or workalike shells), limiting their utility outside the Unix world.

The drawback of self-extracting shell scripts (any kind, not just shar) is that they rely on a particular implementation of programs; shell archives created with older versions of makeself

GNU paxutils

cpio
  • https://en.wikipedia.org/wiki/cpio - The cpio archive format has several basic limitations: It does not store user and group names, only numbers. As a result, it cannot be reliably used to transfer files between systems with dissimilar user and group numbering.
cd /old-dir
find . -name '*.mov' -print | cpio -pvdumB /new-dir [37]
tar
-z: Compress archive using gzip program
-c: Create archive
-v: Verbose i.e display progress while creating archive
-f: Archive File name
tar -zcvf archive-name.tar.gz directory-name


tar -cjf foo.tar.bz2 bar/
  create bzipped tar archive of the directory bar called foo.tar.bz2
tar -xvf foo.tar
  verbosely extract foo.tar
tar -xzf foo.tar.gz
  extract gzipped foo.tar.gz

tar -xjf foo.tar.bz2 -C bar/
  extract bzipped foo.tar.bz2 after changing directory to bar
tar -xzf foo.tar.gz blah.txt
  extract the file blah.txt from foo.tar.gz



pax
  • pax - will read, write, and list the members of an archive file, and will copy directory hierarchies. pax operation is independent of the specific archive format, and supports a wide variety of different archive formats. A list of supported archive formats can be found under the description of the -x option. [39]
mkdir newdir
cd olddir
pax -rw . newdir


pak
  • PAK File - an archive used by video games such as Quake, Hexen, Crysis, Far Cry, Half-Life, and Exient XGS Engine games. It may include graphics, objects, textures, sounds, and other game data "packed" into a single file. PAK files are often just a renamed .ZIP file.


  • https://wiki.arx-libertatis.org/PAK_file_format - very simple archive format with a file table that contains offsets to the file data, which can be stored at arbitrary positions and in arbitrary order but must be contiguous for each individual file in the archive.



libarchive


Afio

  • Afio - makes cpio-format archives. It deals somewhat gracefully with input data corruption, supports multi-volume archives during interactive operation, and can make compressed archives that are much safer than compressed tar or cpio archives. Afio is best used as an `archive engine' in a backup script.

makeself

Trash can


Caches

Storage space

CLI tools

du (disk usage)

du -sh
  size of a folder
du -S
  size of files in a folder

du -aB1m|awk '$1 >= 100'
  everything over 100Mb
cd / | sudo du -khs *
  show root folder size

sudo du -a --max-depth=1 /usr/lib | sort -n -r | head -n 20
  size of program folders /usr/lib

du -sk ./* | sort -nr | awk 'BEGIN{ pref[1]="K"; pref[2]="M"; pref[3]="G";} { total = total + $1;
x = $1; y = 1;  while( x > 1024 ) { x = (x + 1023)/1024; y++; }
printf("%g%s\t%s\n",int(x*10)/10,pref[y],$2); } END { y = 1; while( total > 1024 )
{ total = (total + 1023)/1024; y++; } printf("Total: %g%s\n",int(total*10)/10,pref[y]); }'


diskus

  • https://github.com/sharkdp/diskus - a very simple program that computes the total size of the current directory. It is a parallelized version of du -sh. On my 8-core laptop, it is about ten times faster than du with a cold disk cache and more than three times faster with a warm disk cache. [40]


dust


df

  • df - report file system disk space usage
    • http://en.wikipedia.org/wiki/Df_(Unix) - a standard Unix command used to display the amount of available disk space for file systems on which the invoking user has appropriate read access. df is typically implemented using the statfs or statvfs system calls.
df -h
  human readable

di

  • di - a disk information utility, displaying everything (and more) that your 'df' command does. It features the ability to display your disk usage in whatever format you prefer. It also checks the user and group quotas, so that the user sees the space available for their use, not the system wide disk space.

dfc

  • https://github.com/Rolinh/dfc - a tool to report file system space usage information. When the output is a terminal, it uses color and graphs by default. It has a lot of features such as HTML, JSON and CSV export, multiple filtering options, the ability to show mount options and so on.


dv

  • https://github.com/ARM-DOE/dv - a command line tool for visualizing disk data usage on systems that don't have access to a graphical environment. After scanning a directory, dv generates an interactive webpage that displays useful information about the disk's contents. This webpage is a sunburst partition layout, where each subdirectory is displayed as a portion of the scanned directory. Hovering over a directory shows its size and percentage of the whole. dv is multiprocessed, and can be used to quickly visualize what is taking up space on a system.

dfrs

pdu

TUI

tdu

cdu

  • cdu - Color du, a perl script which calls du and display a pretty histogram with optional colors which allow to imediatly see the directories which take disk space.


ncdu

  • ncdu - ncurses disk usage
ncdu / --exclude /home --exclude /media --exclude /run/media
  check everything apart from home and external drives

ncdu / --exclude /home --exclude /media --exclude /run/media
  check everything apart from external drives
ncdu / --exclude /home --exclude /media --exclude /run/media --exclude /boot
--exclude /tmp --exclude /dev --exclude /proc
  just the root partition


godu

gdu

Broot

ohmu

duff

GUI

Baobab

  • Baobab - aka Disk Usage Analyzer, is a graphical, menu-driven viewer that you can use to view and monitor your disk usage and folder structure. It is part of every GNOME desktop.

QDirStat

  • https://github.com/shundhammer/qdirstat - a graphical application to show where your disk space has gone and to help you to clean it up. This is a Qt-only port of the old Qt3/KDE3-based KDirStat, now based on the latest Qt 5. It does not need any KDE libs or infrastructure. It runs on every X11-based desktop on Linux, BSD and other Unix-like systems.

Filelight

  • Filelight - an application to visualize the disk usage on your computerFeatures: Scan local, remote or removable disks; Configurable color schemes; File system navigation by mouse clicks; Information about files and directories on hovering; Files and directories can be copied or removed directly from the context menu Integration into Konqueror and Krusader

xdiskusage

  • xdiskusage - a user-friendly program to show you what is using up all your disk space. It is based on the design of xdu written by Phillip C. Dykstra <dykstra at ieee dot org>. Changes have been made so it runs "du" for you, and can display the free space left on the disk, and produce a PostScript version of the display.

Filelight

  • Filelight - creates an interactive map of concentric, segmented rings that help visualise disk usage on your computer.

KDiskFree

  • KDiskFree - displays the available file devices (hard drive partitions, floppy and CD drives, etc.) along with information on their capacity, free space, type and mount point. It also allows you to mount and unmount drives and view them in a file manager.

Multi

Duc

  • Duc - a collection of tools for indexing, inspecting and visualizing disk usage. Duc maintains a database of accumulated sizes of directories of the file system, and allows you to query this database with some tools, or create fancy graphs showing you where your bytes are.

Web

diskover

  • diskover - File system crawler, storage search engine and storage analytics software powered by Elasticsearch to help visualize and manage your disk space usage.

Tagging

  • TMSU - a tool for tagging your files. It provides a simple command-line tool for applying tags and a virtual filesystem so that you can get a tag-based view of your files from within any other program.TMSU does not alter your files in any way: they remain unchanged on disk, or on the network, wherever you put them. TMSU maintains its own database and you simply gain an additional view, which you can mount, based upon the tags you set up. The only commitment required is your time and there's absolutely no lock-in.