You've stated that you need to support "tar, bz2, zip or tar.gz". Python's tarfile module will automatically handle gz and bz2 compressed tar files, so there is really only 2 types of archive that you need to support: tar and zip. (bz2 by itself is not an archive format, it's just compression).
You can determine whether a given file is a tar file with tarfile.is_tarfile(). This will also work on tar files compressed with gzip or bzip2 compression. Within a tar file you can determine whether a file is a directory using TarInfo.isdir() or a file with TarInfo.isfile().
Similarly you can determine whether a file is a zip file using zipfile.is_zipfile(). With zipfile there is no method to distinguish directories from normal file, but files that end with / are directories.
So, given a file name, you can do this:
import zipfile
import tarfile
filename = 'test.tgz'
if tarfile.is_tarfile(filename):
f = tarfile.open(filename)
for info in f:
if info.isdir():
file_type = 'directory'
elif info.isfile():
file_type = 'file'
else:
file_type = 'unknown'
print('{} is a {}'.format(info.name, file_type))
elif zipfile.is_zipfile(filename):
f = zipfile.ZipFile(filename)
for name in f.namelist():
print('{} is a {}'.format(name, 'directory' if name.endswith('/') else 'file'))
else:
print('{} is not an accepted archive file'.format(filename))
When run on a tar file with this structure:
(py2)[mhawke@localhost tmp]$ tar tvfz /tmp/test.tgz drwxrwxr-x mhawke/mhawke 0 2016-02-29 12:38 x/ lrwxrwxrwx mhawke/mhawke 0 2016-02-29 12:38 x/4 -> 3 drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:14 x/3/ drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:14 x/3/4/ -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:14 x/3/4/zzz drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:13 x/2/ -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/2/aa drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:13 x/1/ -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/abc -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/ab -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/a
The output is:
x is a directory x/4 is a unknown x/3 is a directory x/3/4 is a directory x/3/4/zzz is a file x/2 is a directory x/2/aa is a file x/1 is a directory x/1/abc is a file x/1/ab is a file x/1/a is a file
Notice that x/4 is "unknown" because it is a symbolic link.
There is no easy way, with zipfile, to distinguish a symlink (or other file types) from a directory or normal file. The information is there in the ZipInfo.external_attr attribute, but it's messy to get it back out:
import stat
linked_file = f.filelist[1]
is_symlink = stat.S_ISLNK(linked_file.external_attr >> 16L)
Answer from mhawke on Stack Overflow
» pip install python-archive
Videos
» pip install Archive
You've stated that you need to support "tar, bz2, zip or tar.gz". Python's tarfile module will automatically handle gz and bz2 compressed tar files, so there is really only 2 types of archive that you need to support: tar and zip. (bz2 by itself is not an archive format, it's just compression).
You can determine whether a given file is a tar file with tarfile.is_tarfile(). This will also work on tar files compressed with gzip or bzip2 compression. Within a tar file you can determine whether a file is a directory using TarInfo.isdir() or a file with TarInfo.isfile().
Similarly you can determine whether a file is a zip file using zipfile.is_zipfile(). With zipfile there is no method to distinguish directories from normal file, but files that end with / are directories.
So, given a file name, you can do this:
import zipfile
import tarfile
filename = 'test.tgz'
if tarfile.is_tarfile(filename):
f = tarfile.open(filename)
for info in f:
if info.isdir():
file_type = 'directory'
elif info.isfile():
file_type = 'file'
else:
file_type = 'unknown'
print('{} is a {}'.format(info.name, file_type))
elif zipfile.is_zipfile(filename):
f = zipfile.ZipFile(filename)
for name in f.namelist():
print('{} is a {}'.format(name, 'directory' if name.endswith('/') else 'file'))
else:
print('{} is not an accepted archive file'.format(filename))
When run on a tar file with this structure:
(py2)[mhawke@localhost tmp]$ tar tvfz /tmp/test.tgz drwxrwxr-x mhawke/mhawke 0 2016-02-29 12:38 x/ lrwxrwxrwx mhawke/mhawke 0 2016-02-29 12:38 x/4 -> 3 drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:14 x/3/ drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:14 x/3/4/ -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:14 x/3/4/zzz drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:13 x/2/ -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/2/aa drwxrwxr-x mhawke/mhawke 0 2016-02-28 21:13 x/1/ -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/abc -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/ab -rw-rw-r-- mhawke/mhawke 0 2016-02-28 21:13 x/1/a
The output is:
x is a directory x/4 is a unknown x/3 is a directory x/3/4 is a directory x/3/4/zzz is a file x/2 is a directory x/2/aa is a file x/1 is a directory x/1/abc is a file x/1/ab is a file x/1/a is a file
Notice that x/4 is "unknown" because it is a symbolic link.
There is no easy way, with zipfile, to distinguish a symlink (or other file types) from a directory or normal file. The information is there in the ZipInfo.external_attr attribute, but it's messy to get it back out:
import stat
linked_file = f.filelist[1]
is_symlink = stat.S_ISLNK(linked_file.external_attr >> 16L)
You can use the string.endswith(string) method to check whether it has the proper file-name extension:
filenames = ['code.tar.gz', 'code2.bz2', 'code3.zip']
fileexts = ['.tar.gz', '.bz2', '.zip']
def check_extension():
for name in filenames:
for ext in fileexts:
if name.endswith(ext):
print ('The file: ', name, ' has the extension: ', ext)
check_extension()
which outputs:
The file: code.tar.gz has the extension: .tar.gz
The file: code2.bz2 has the extension: .bz2
The file: code3.zip has the extension: .zip
You would have to create a list of the file extensions for each and every archive file-type you'd want to check against, and would need to load in the file-name into a list where you can easily execute the check, but I think this would be a fairly effective way to solve your issue.
Archives don't really make sense if all you going to do is put one file into each of them.
Or did you mean to compress them, e.g. using the gzip or bz2 module?
If you indeed really want archives with only a single file, create a tar or ZIP object and just add it straight away, e.g. for TarFile.add.
Note that while it is very common when using unix-like operating systems to compress single files using bz2 or gzip doing so is very uncommon on other platforms, e.g. Windows. There the recommendation would be to use ZIP files, even for single files, since they are handled well by applications (Windows Explorer and others).
To put a single file into a ZIP file do something similar to this:
import zipfile
# ...
with zipfile.ZipFile(nameOfOrginalFile + ".zip", "w", zipfile.ZIP_DEFLATED) as zip_file:
zip_file.write(nameOfOriginalFile)
Not passing ZIP_DEFLATED to ZipFile will result in an uncompressed zip file.
To compress a single file using e.g. gzip:
import gzip
with gzip.GzipFile(nameoforiginalFile + ".gz", "w") as gz:
with open(nameoforignalfile) as inp_file;
shutil.copyfileobj(inp_file, gz)
The bz2 and lzma (not available for Python 2) APIs are the same, just import bz2/lzma and use bz2.BZ2File instead.
After both with blocks you can delete the original file (os.remove(file)) and move the archive file to the correct location. Alternatively, create the archive file directly in the correct location (os.path.join the target location and the archive name).
The standard library contains zipfile module for working with .zip archives, gzip module for working with .gz compressed files and bz2 module for working with .bz2 compressed files (the later is slower, but yields better compression).
Python 3.3 also introduces [lzma] (for .xz and .lzma files), which has even better compression ratio, but it does not seem to be backported to 2.7.
Note that a single file does not need .tar.gz, a .gz will do. Because .tar.gz is two levels. .tar to put several files together and .gz to compress it and you don't need the first part if you have just one. Zip does both things, so for single file it is slightly less efficient than gz (they use the same compression method), but you may have some tool that understands zip files and not gz files, so there may be some reason to use it.
To create single compressed file with gzip, bz2 or [lzma], you just use open function from the respective module and then use shutil.copyfileobj to copy the content of the source file to the archive.
The easiest way is to use shutil.make_archive. It supports both zip and tar formats.
import shutil
shutil.make_archive(output_filename, 'zip', dir_name)
If you need to do something more complicated than zipping the whole directory (such as skipping certain files), then you'll need to dig into the zipfile module as others have suggested.
As others have pointed out, you should use zipfile. The documentation tells you what functions are available, but doesn't really explain how you can use them to zip an entire directory. I think it's easiest to explain with some example code:
import os
import zipfile
def zipdir(path, ziph):
# ziph is zipfile handle
for root, dirs, files in os.walk(path):
for file in files:
ziph.write(os.path.join(root, file),
os.path.relpath(os.path.join(root, file),
os.path.join(path, '..')))
with zipfile.ZipFile('Python.zip', 'w', zipfile.ZIP_DEFLATED) as zipf:
zipdir('tmp/', zipf)