-- Find identical files -- Rapid Deployment Software 2000 - Public Domain -- Modified by R. M. Forno - 2006/10/02 -- This program can be run without change on DOS, Windows and Linux. -- By default, it scans the current directory and all subdirectories. -- You can also specify one or more directories on the command line, -- and it will only look at those directories. -- You can even scan an entire drive or multiple drives. -- You can request that no subdirectories be scanned. In order to do that, -- you should precede the specification of the directory with a minus (-) sign. -- You can also specify that paths starting with some string be excluded -- from the comparison. To do so, you should precede these paths with the -- plus (+) sign. -- Usage examples: -- exu dupfile.exw -- scan current directory and below -- ex dupfile.exw -- press 'i' to ignore any critical errors -- exw dupfile C:\ D:\ > duplist -- scan specific directories and below -- exw dupfile my_dir -- scan my_dir under current directory/drive -- exw dupfile C: -- scan entire drive C with subdirectories -- exw dupfile -. -- scan current directory only (no subs) -- exw dupfile C:\EUPHORIA +\EUPHORIA\DOC -- exclude \EUPHORIA\DOC and subs -- exw dupfile \EUPHORIA -+\EUPHORIA\DOC -- exclude only \EUPHORIA\DOC -- exw dupfile "-C:\Program Files" -- scan only C:\Program Files -- NOTE 1: if platform() is WIN32, you cannot specify a relative path (no SLASH -- ahead) for a drive that is not the current one. For example, if you sit -- in C:, D:GAMES will not work as D:\\GAMES but instead as -- D:\GAMES. This is due to a problem with the functions system() and -- current_dir() under WIN32. No fix is envisioned yet. -- NOTE 2: if platform() is LINUX, you cannot specify the root path, /, due -- to a bug in the function walk_dir(). Rob knows of it, and he promised -- to fix it in the next release. -- In case both - and + are used, they may be in any order -- You can even use -+- or --, but not +-+... that would have been -- too much work :-) -- The output shows the quoted absolute paths to the matching files. -- You can easily use this output to delete the duplicated files you want. -- To do so, you should redirect the program's output to a file named, -- for example, dup.bat. You can then edit this file in order to select -- the files you want to delete. -- How it works: -- 1. It gathers a list of all the file names and sizes -- from the directories that you specify on the command line. -- 2. It sorts the list by file size. -- 3. It compares files of equal size to see if they are identical. include file.e include get.e constant TRUE = 1, FALSE = 0, EOF = -1 constant SIZE = 1, NAME = 2 constant ESC = 27 constant FIND_LIMIT = 30 -- break point for find's search efficiency constant MEM = 100000 -- reserved RAM for reading files integer SLASH if platform() = LINUX then SLASH = '/' else SLASH = '\\' end if sequence dir_list -- list of directories to scan dir_list = {} sequence recurse -- recurse flags for dir_list recurse = {} sequence exclude_list -- list of directories to exclude exclude_list = {} sequence ex_recur -- recurse flags for exclude_list ex_recur = {} sequence file_info -- names and sizes file_info = {} sequence attribs -- file attributes attribs = {} sequence exclude -- excluded files names and sizes exclude = {} integer max_ex -- binary seek lmit for exclude list integer max_info -- binary seek limit for file_info list function GradeUp(sequence s, integer n) --This function is a merge sort of the Von Neumann kind, that is, -- first generates variable length sequences and then merges them. --While it is more complex than the merge sort provided in EUPHORIA -- demo ALLSORTS, it has some advantages: --1) According to the argument n being zero or not zero, it respectively -- gives the indexes of the sorted elements of s or the elements themselves. -- The sorted indexes are useful when one sorts by a key but wants -- to sort the remaining data, that for example resides in a "parallel" -- sequence. You can see an example of that use in the IndexOf routine -- that follows. --2) This sort is faster than most sorts in demo ALLSORTS for RANDOM data. -- Its performance is approximately equal to the stardard sort provided -- with EUPHORIA (Shell sort). However, its main advantage is being MUCH -- FASTER than most sorts for data that is nearly ordered, either in NORMAL -- or REVERSE order, or in large chunks of data ordered both ways. --3) It is stable, that is, it preserves the original order between -- equal elements. Some of the other sorts are not stable (quick_sort, -- for example). This is not important if the output consists in the -- elements themselves, but it matters if the output consists in the -- indexes. sequence z, r, w, t object low, high, temp, one, two integer len, m, k, g, h, f, a, b, mid, indz, indr, indt, indw, len1 len = length(s) if len <= 1 then if n then return s else return repeat(1, len) end if end if indz = 1 z = repeat(0, len) mid = len - 1 indr = 0 r = repeat(0, mid) --Indexes of developed data sequences t = repeat(0, len + mid) --Work area for developing sequences m = 1 while m <= len do --Develop ordered data sequences from the input indr += 1 low = s[m] high = low t[len] = m a = len b = len while m < len do m += 1 temp = s[m] if compare(temp, high) >= 0 then --Data in normal order b += 1 t[b] = m high = temp elsif compare(temp, low) < 0 then --Data in reverse order a -= 1 t[a] = m low = temp else --End of sequenced data m -= 1 exit end if end while z[indz..m] = t[a..b] --Add sequence to result m += 1 indz = m r[indr] = m end while if indr <= 1 then --If only one data sequence developed if n then for i = 1 to len do z[i] = s[z[i]] end for end if return z end if len1 = len + 1 r = r[1..indr] t = r w = z while 1 do --Start merge process indw = 0 indt = 0 k = 1 --First bound for i = 2 to indr by 2 do --Merge pairs of data sequences g = r[i - 1] h = r[i] f = g a = z[k] one = s[a] b = z[g] two = s[b] while 1 do --Merge two data sequences indw += 1 if compare(one, two) <= 0 then w[indw] = a k += 1 if k >= f then exit end if a = z[k] one = s[a] else w[indw] = b g += 1 if g >= h then exit end if b = z[g] two = s[b] end if end while --Both conditions cannot hold true at the same time if k < f then indw += 1 f -= 1 b = indw + f - k w[indw..b] = z[k..f] --Remainder of first data sequence indw = h - 1 elsif g < h then indw = h - 1 w[g..indw] = z[g..indw] --Remainder of second data sequence end if indt += 1 t[indt] = h k = h --First bound = previous last bound end for if and_bits(1, indr) then --Orphaned last data sequence w[k..len] = z[k..len] indt += 1 t[indt] = len1 end if if indt <= 1 then --If only one data sequence remains if n then for i = 1 to len do w[i] = s[w[i]] end for end if return w end if r = t[1..indt] indr = indt z = w end while end function function apply_grade(sequence s, sequence index) -- apply sorted index vector to data for i = 1 to length(index) do index[i] = s[index[i]] end for return index end function function BinSeek(object x, sequence s, integer min, integer max) --Finds index of x in s between indexes min and max. --s should be sorted. --If x is not found returns the negative of the position where it would be -- inserted (like EDS). --For performance reasons, min and max are not checked for validity, and hence -- out of range values will cause the routine to crash. integer mid, test while min <= max do mid = floor((min + max) / 2) test = compare(x, s[mid]) if test > 0 then min = mid + 1 elsif test < 0 then max = mid - 1 else return mid end if end while return - min end function function search(object item, sequence s, integer max) -- searchs item in an ordered sequence integer k if length(s) <= FIND_LIMIT then -- if find still efficient return find(item, s) else k = BinSeek(item, s, 1, max) if k > 0 then return k else return 0 end if end if end function function parse_dos(sequence s) -- process . and .. for DOS integer i, ind1, len, count, flag sequence t flag = 1 while flag do i = 1 flag = 0 len = length(s) if len > 1 and s[len] = SLASH then return "" end if t = s while i <= len do while i <= len and s[i] != SLASH do i += 1 end while if i > len then exit end if ind1 = i count = 0 i += 1 while i <= len and s[i] = '.' do count += 1 i += 1 end while if i > len or s[i] = SLASH then if count = 0 then if len = 1 then return s end if return "" elsif count > 2 then return "" else while ind1 and count do if s[ind1] = SLASH then count -= 1 end if ind1 -= 1 end while if count then return "" elsif ind1 then t = s[1..ind1] & s[i..len] else if i > len then t = {SLASH} else t = SLASH & s[i + 1..len] end if end if end if exit else i += 1 end if end while if not equal(t, s) then flag = 1 s = t end if end while return s end function function parse_linux(sequence s) -- process . and .. for Linux integer i, ind1, len, count, flag sequence t flag = 1 while flag do i = 1 flag = 0 len = length(s) t = s while i <= len do while i <= len and s[i] != SLASH do i += 1 end while if i > len then exit end if ind1 = i count = 0 i += 1 while i <= len and s[i] = '.' do count += 1 i += 1 end while if i > len or s[i] = SLASH then if count < 2 then if len = 1 then return s end if t = s[1..ind1 - 1] & s[i..len] exit elsif count = 2 then while ind1 and count do if s[ind1] = SLASH then count -= 1 end if ind1 -= 1 end while if count or ind1 = 0 then t = SLASH & s[i..len] else t = s[1..ind1] & s[i..len] end if end if exit else i += 1 end if end while if not equal(t, s) then flag = 1 s = t end if end while return s end function procedure insert(sequence c, integer recur, integer excl) -- insert item in either dir_list or exclude_list if length(c) then if excl then if not find(c, exclude_list) then -- insure no duplicates exclude_list = append(exclude_list, c) ex_recur &= recur end if else if not find(c, dir_list) then -- insure no duplicates dir_list = append(dir_list, c) recurse &= recur end if end if end if end procedure constant TOUPPER = 'A' - 'a' function upper(object c) return c + TOUPPER * (c >= 'a' and c <= 'z') end function procedure parse_cmd() -- get the list of directories to scan sequence cl, curr_dir, c, drive, new_curr, p integer recur, excl, letter, len cl = command_line() curr_dir = current_dir() drive = curr_dir[1..2] letter = drive[1] if length(cl) < 3 then dir_list = {curr_dir} recurse = {1} end if for i = 3 to length(cl) do c = cl[i] recur = 1 -- defaults excl = 0 if c[1] = '-' then -- no recursion c = c[2..length(c)] recur = 0 end if len = length(c) if len then if c[1] = '+' then -- exclusion c = c[2..len] excl = 1 len -= 1 end if end if if len then if c[1] = '-' then -- no recursion, again (order may be -+ or +-) c = c[2..len] recur = 0 len -= 1 end if end if if platform() != LINUX then c = upper(c) -- avoid differences due to letter case if len >= 2 and c[2] = ':' then -- drive letter present if len = 2 or c[3] != SLASH then -- relative path if c[1] = letter then -- current drive if len = 2 then -- no path c = curr_dir else if length(curr_dir) > 3 then -- no root dir c = curr_dir & SLASH & c[3..len] else -- root dir c = curr_dir & c[3..len] end if end if else -- another drive if platform() = DOS32 then -- doesnt work on WIN32 system(c[1..2], 2) -- try to go to the other drive new_curr = current_dir() if new_curr[1] = letter then -- failed c = drive -- skip this one else if len = 2 then -- no path c = new_curr -- normalize to full spec else c = new_curr & SLASH & c[3..len] end if system(drive, 2) -- restore to current drive end if else -- platform() = WIN32 (impossible to get curr drv) c = c[1..2] & SLASH & c[3..len] -- force absolute end if end if -- do nothing if absolute path complete (with drive letter) end if else -- no drive letter if c[1] = SLASH then -- absolute path c = drive & c -- add drive letter else -- relative path if length(curr_dir) > 3 then -- not root dir c = curr_dir & SLASH & c else -- root dir c = curr_dir & c end if end if end if p = parse_dos(c[3..length(c)]) -- normalize for . and .. if length(p) then c = c[1..2] & p else c = "" end if else c = parse_linux(c) -- normalize for . and .. end if insert(c, recur, excl) end for end procedure function look_at_ex(sequence path_name, sequence entry) -- record file info for excluded files sequence attr, item attr = entry[D_ATTRIBUTES] if find('d', attr) then return 0 -- ignore directories end if path_name &= SLASH & entry[D_NAME] item = {entry[D_SIZE], upper(path_name)} if not search(item, exclude, max_ex) then exclude = append(exclude, item) end if return 0 end function function look_at(sequence path_name, sequence entry) -- record file info for selected files sequence attr, item attr = entry[D_ATTRIBUTES] if find('d', attr) then return 0 -- ignore directories end if path_name &= SLASH & entry[D_NAME] item = {entry[D_SIZE], upper(path_name)} if not search(item, file_info, max_info) then if not search(item, exclude, max_ex) then file_info = append(file_info, item) attribs = append(attribs, attr) end if end if return 0 end function procedure gen_exclude() sequence ex integer r max_ex = 0 for i = 1 to length(exclude_list) do ex = exclude_list[i] puts(2, "Scanning for exclusions ") if ex_recur[i] = 0 then puts(2, "(no recursion) ") end if puts(2, ex & "\n\n") r = walk_dir(ex, routine_id("look_at_ex"), ex_recur[i]) exclude = GradeUp(exclude, 1) max_ex = length(exclude) end for end procedure procedure gen_file_info() sequence dir, s integer r max_info = 0 for i = 1 to length(dir_list) do dir = dir_list[i] puts(2, "Scanning ") if recurse[i] = 0 then puts(2, "(no recursion) ") end if puts(2, dir & "\n\n") r = walk_dir(dir, routine_id("look_at"), recurse[i]) s = GradeUp(file_info, 0) file_info = apply_grade(file_info, s) attribs = apply_grade(attribs, s) max_info = length(file_info) end for end procedure function same(sequence file1, sequence file2) -- return TRUE if files are identical integer f1, f2 integer c1, c2 integer s f1 = open(file1, "rb") if f1 = -1 then puts(2, "Cannot open file " & file1 & ", previously opened\n") return FALSE end if f2 = open(file2, "rb") if f2 = -1 then puts(2, "Cannot open file " & file2 & ", previously opened\n") close(f1) return FALSE end if while TRUE do c1 = getc(f1) c2 = getc(f2) if c1 != c2 then s = FALSE exit elsif c1 = EOF then s = TRUE exit end if end while close(f1) close(f2) return s end function procedure check_equals(sequence match_len, integer size) integer max_len, len, f1, k, n, flag sequence data, list, numbers, file1, ind len = length(match_len) max_len = floor(MEM / len) -- compute length to read if size < max_len then max_len = size end if list = {} numbers = {} for i = 1 to len do -- read in first data bytes k = match_len[i] file1 = file_info[k][NAME] f1 = open(file1, "rb") if f1 = -1 then puts(2, "\nCannot open file " & file1 & '\n') else data = get_bytes(f1, max_len) list = append(list, data) numbers &= k close(f1) end if end for len = length(list) if len <= 1 then -- no pairs to match return end if ind = GradeUp(list, 0) -- sort equal length files data list = apply_grade(list, ind) numbers = apply_grade(numbers, ind) if size = max_len then -- all data read for i = 1 to len - 1 do k = numbers[i] if k then data = list[i] flag = 1 -- marks 1st time for j = i + 1 to len do if equal(data, list[j]) then n = numbers[j] numbers[j] = 0 if flag then -- if 1st time printf(1, "%d bytes, identical files:\n", size) printf(1, "del \"%s\"\n", {file_info[k][NAME]}) flag = 0 end if printf(1, "del \"%s\"\n", {file_info[n][NAME]}) else exit end if end for end if end for else -- not all data read for i = 1 to len - 1 do k = numbers[i] if k then data = list[i] flag = 1 -- marks 1st time for j = i + 1 to len do if equal(data, list[j]) then n = numbers[j] if n then -- if not used yet if same(file_info[k][NAME], file_info[n][NAME]) then numbers[j] = 0 if flag then -- if 1st time printf(1, "%d bytes, identical files:\n", size) printf(1, "del \"%s\"\n", {file_info[k][NAME]}) flag = 0 end if printf(1, "del \"%s\"\n", {file_info[n][NAME]}) end if end if else exit end if end for end if end for end if end procedure procedure check_dups() -- step through the sorted list of files integer i, c, size, len sequence match_len -- file_info is already sorted by first field - size len = length(file_info) i = 1 while i < len do size = file_info[i][SIZE] match_len = {i} i += 1 while i <= len do -- select indexes of same size files if file_info[i][SIZE] = size then match_len &= i else exit end if i += 1 end while if length(match_len) > 1 then check_equals(match_len, size) end if -- check for user abort c = get_key() if c != -1 and c >= ESC and not find(c, "rif") then puts(1, "\nABORTED\n") exit end if end while end procedure procedure main() parse_cmd() gen_exclude() gen_file_info() check_dups() end procedure main() if platform() = WIN32 then puts(2, "Press Enter\n") if getc(0) then end if end if