Any suggestions?
Use Streaming Upload, as docs put it:
Answer from Daweo on Stack OverflowRequests supports streaming uploads, which allow you to send large streams or files without reading them into memory. To stream and upload, simply provide a file-like object for your body:
with open('massive-body', 'rb') as f: requests.post('http://some.url/streamed', data=f)
Any suggestions?
Use Streaming Upload, as docs put it:
Requests supports streaming uploads, which allow you to send large streams or files without reading them into memory. To stream and upload, simply provide a file-like object for your body:
with open('massive-body', 'rb') as f: requests.post('http://some.url/streamed', data=f)
When you pass files arg then requests lib makes a multipart form upload. i.e. it is like submitting a form, where the file is passed as a named field (file in your example)
I suspect the problem you saw is because when you pass a file object as data arg, as suggested in the docs here https://requests.readthedocs.io/en/latest/user/advanced/#streaming-uploads then it does a streaming upload but the file content is used as the whole http post body.
So I think the server at the other end is expecting a form with a file field, but we're just sending the binary content of the file by itself.
What we need is some way to wrap the content of the file with the right "envelope" as we send it to the server, so that it can recognise the data we are sending.
See this issue where others have noted the same problem: https://github.com/psf/requests/issues/1584
I think the best suggestion from there is to use this additional lib, which provides streaming multipart form file upload: https://github.com/requests/toolbelt#multipartform-data-encoder
For example:
from requests_toolbelt import MultipartEncoder
import requests
encoder = MultipartEncoder(
fields={'file': ('myfilename.xyz', open(path, 'rb'), 'text/plain')}
)
response = requests.post(
url, data=encoder, headers={'Content-Type': encoder.content_type}
)
Can't upload file larger than 2GB to webdav by requests
Python: HTTP Post a large file with streaming - Stack Overflow
python requests: post and big content - Stack Overflow
Uploading of large files/archives through Python API results in ConnectionAbortedError [WinError 10054]
Reading through the mailing list thread linked to by systempuntoout, I found a clue towards the solution.
The mmap module allows you to open file that acts like a string. Parts of the file are loaded into memory on demand.
Here's the code I'm using now:
Copyimport urllib2
import mmap
# Open the file as a memory mapped string. Looks like a string, but
# actually accesses the file behind the scenes.
f = open('somelargefile.zip','rb')
mmapped_file_as_string = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Do the request
request = urllib2.Request(url, mmapped_file_as_string)
request.add_header("Content-Type", "application/zip")
response = urllib2.urlopen(request)
#close everything
mmapped_file_as_string.close()
f.close()
The documentation doesn't say you can do this, but the code in urllib2 (and httplib) accepts any object with a read() method as data. So using an open file seems to do the trick.
You'll need to set the Content-Length header yourself. If it's not set, urllib2 will call len() on the data, which file objects don't support.
Copyimport os.path
import urllib2
data = open(filename, 'r')
headers = { 'Content-Length' : os.path.getsize(filename) }
response = urllib2.urlopen(url, data, headers)
This is the relevant code that handles the data you supply. It's from the HTTPConnection class in httplib.py in Python 2.7:
Copydef send(self, data):
"""Send `data' to the server."""
if self.sock is None:
if self.auto_open:
self.connect()
else:
raise NotConnected()
if self.debuglevel > 0:
print "send:", repr(data)
blocksize = 8192
if hasattr(data,'read') and not isinstance(data, array):
if self.debuglevel > 0: print "sendIng a read()able"
datablock = data.read(blocksize)
while datablock:
self.sock.sendall(datablock)
datablock = data.read(blocksize)
else:
self.sock.sendall(data)
Using an open file object as the data parameter ensures that requests will stream the data for you.
If a file size can be determined (via the OS filesystem), the file object is streamed using a 8kb buffer. If no filesize can be determined, a Transfer-Encoding: chunked request is sent sending the data per line instead (the object is used as an iterable).
If you were to use the files= parameter for a multipart POST, on the other hand, the file would be loaded into memory before sending. Use the requests-toolbelt package to stream multi-part uploads:
import requests
from requests_toolbelt.multipart.encoder import MultipartEncoder
csvfile = '/path/file.csv'
with open(csvfile) as f:
m = MultipartEncoder(fields={'csv_field_name': ('file.csv', f, 'text/csv')})
headers = {'Content-Type': m.content_type}
r = requests.post(url, data=m, headers=headers)
This will not load the entire file into memory, it will be split into chunks and transmitted a little at a time. You can see this in the source code here.
The missing piece is to turn on streaming with stream=True—that's what tells Requests not to read the whole content into memory before you have a chance to look at it.
With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file:
def download_file(url):
local_filename = url.split('/')[-1]
# NOTE the stream=True parameter below
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
# If you have chunk encoded response uncomment if
# and set chunk_size parameter to None.
#if chunk:
f.write(chunk)
return local_filename
Note that the number of bytes returned using iter_content is not exactly the chunk_size; it's expected to be a random number that is often far bigger, and is expected to be different in every iteration.
See body-content-workflow and Response.iter_content for further reference.
It's much easier if you use Response.raw and shutil.copyfileobj():
import requests
import shutil
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url, stream=True) as r:
with open(local_filename, 'wb') as f:
shutil.copyfileobj(r.raw, f)
return local_filename
This streams the file to disk without using excessive memory, and the code is simple.
Note: According to the documentation, Response.raw will not decode gzip and deflate transfer-encodings, so you will need to do this manually.