The problem is that you transform the result of ThreadPoolExecutor.map to a list. If you don't do this and instead iterate over the resulting generator directly, the results are still yielded in the original order but the loop continues before all results are ready. You can test this with this example:
import time
import concurrent.futures
e = concurrent.futures.ThreadPoolExecutor(4)
s = range(10)
for i in e.map(time.sleep, s):
print(i)
The reason for the order being kept may be because it's sometimes important that you get results in the same order you give them to map. And results are probably not wrapped in future objects because in some situations it may take just too long to do another map over the list to get all results if you need them. And after all in most cases it's very likely that the next value is ready before the loop processed the first value. This is demonstrated in this example:
import concurrent.futures
executor = concurrent.futures.ThreadPoolExecutor() # Or ProcessPoolExecutor
data = some_huge_list()
results = executor.map(crunch_number, data)
finals = []
for value in results:
finals.append(do_some_stuff(value))
In this example it may be likely that do_some_stuff takes longer than crunch_number and if this is really the case it's really not a big loss of performance while you still keep the easy usage of map.
Also since the worker threads(/processes) start processing at the beginning of the list and work their way to the end to the list you submitted the results should be finished in the order they're already yielded by the iterator. Which means in most cases executor.map is just fine, but in some cases, for example if it doesn't matter in which order you process the values and the function you passed to map takes very different times to run, the future.as_completed may be faster.
Hi,
As the title suggests I want to map each item in a Pandas data frame to a function and the function then performs an operation on the given item from the data frame. Here is my code so far
def product_check(item, function_lock):
if not shop.product_exists(item.sku):
with function_lock:
category_id = shop.category_check(item.category)
time.sleep(3)
shop.create_product(item, category_id)
def start(user):
items = [*some data frame*]
executor = concurrent.futures.ProcessPoolExecutor()
manager = multiprocessing.Manager()
function_lock = manager.Lock()
with executor:
executor.map(product_check, (items, function_lock,))
However, the with executor part works, but when it gets to the executor.map part the product_check function never runs. Any ideas what could be the issue. Thanks
Make Executor.map work with infinite/large inputs correctly
Lazy collection of iterables for concurrent.futures.Executor.map
dictionary - Python concurrent executor.map() and submit() - Stack Overflow
How to use executor.map function in python on keyword arguments - Stack Overflow
Videos
The problem is that you transform the result of ThreadPoolExecutor.map to a list. If you don't do this and instead iterate over the resulting generator directly, the results are still yielded in the original order but the loop continues before all results are ready. You can test this with this example:
import time
import concurrent.futures
e = concurrent.futures.ThreadPoolExecutor(4)
s = range(10)
for i in e.map(time.sleep, s):
print(i)
The reason for the order being kept may be because it's sometimes important that you get results in the same order you give them to map. And results are probably not wrapped in future objects because in some situations it may take just too long to do another map over the list to get all results if you need them. And after all in most cases it's very likely that the next value is ready before the loop processed the first value. This is demonstrated in this example:
import concurrent.futures
executor = concurrent.futures.ThreadPoolExecutor() # Or ProcessPoolExecutor
data = some_huge_list()
results = executor.map(crunch_number, data)
finals = []
for value in results:
finals.append(do_some_stuff(value))
In this example it may be likely that do_some_stuff takes longer than crunch_number and if this is really the case it's really not a big loss of performance while you still keep the easy usage of map.
Also since the worker threads(/processes) start processing at the beginning of the list and work their way to the end to the list you submitted the results should be finished in the order they're already yielded by the iterator. Which means in most cases executor.map is just fine, but in some cases, for example if it doesn't matter in which order you process the values and the function you passed to map takes very different times to run, the future.as_completed may be faster.
Below is an example of .submit() vs .map(). They both accept the jobs immediately (submitted|mapped - start). They take the same time to complete, 11 seconds (last result time - start). However, .submit() gives results as soon as any thread in the ThreadPoolExecutor maxThreads=2 completes (unordered!). While .map() gives results in the order they are submitted.
import time
import concurrent.futures
def worker(i):
time.sleep(i)
return i,time.time()
e = concurrent.futures.ThreadPoolExecutor(2)
arrIn = range(1,7)[::-1]
print arrIn
f = []
print 'start submit',time.time()
for i in arrIn:
f.append(e.submit(worker,i))
print 'submitted',time.time()
for r in concurrent.futures.as_completed(f):
print r.result(),time.time()
print
f = []
print 'start map',time.time()
f = e.map(worker,arrIn)
print 'mapped',time.time()
for r in f:
print r,time.time()
Output:
[6, 5, 4, 3, 2, 1]
start submit 1543473934.47
submitted 1543473934.47
(5, 1543473939.473743) 1543473939.47
(6, 1543473940.471591) 1543473940.47
(3, 1543473943.473639) 1543473943.47
(4, 1543473943.474192) 1543473943.47
(1, 1543473944.474617) 1543473944.47
(2, 1543473945.477609) 1543473945.48
start map 1543473945.48
mapped 1543473945.48
(6, 1543473951.483908) 1543473951.48
(5, 1543473950.484109) 1543473951.48
(4, 1543473954.48858) 1543473954.49
(3, 1543473954.488384) 1543473954.49
(2, 1543473956.493789) 1543473956.49
(1, 1543473955.493888) 1543473956.49
Here is the map version of your existing code. Note that the callback now accepts a tuple as a parameter. I added an try\except in the callback so the results will not throw an error. The results are ordered according to the input list.
from concurrent.futures import ThreadPoolExecutor
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://www.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(tt): # (url,timeout)
url, timeout = tt
try:
with urllib.request.urlopen(url, timeout=timeout) as conn:
return (url, conn.read())
except Exception as ex:
print("Error:", url, ex)
return(url,"") # error, return empty string
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(load_url, [(u,60) for u in URLS]) # pass url and timeout as tuple to callback
executor.shutdown(wait=True) # wait for all complete
print("Results:")
for r in results: # ordered results, will throw exception here if not handled in callback
print(' %r page is %d bytes' % (r[0], len(r[1])))
Output
Error: http://www.wsj.com/ HTTP Error 404: Not Found
Results:
'http://www.foxnews.com/' page is 320028 bytes
'http://www.cnn.com/' page is 1144916 bytes
'http://www.wsj.com/' page is 0 bytes
'http://www.bbc.co.uk/' page is 279418 bytes
'http://some-made-up-domain.com/' page is 64668 bytes
Without using the map method, you can use enumerate to build the future_to_url dict with not just the URLs as values, but also their indices in the list. You can then build a dict from the future objects returned by the call to concurrent.futures.as_completed(future_to_url) with indices as the keys, so that you can iterate an index over the length of the dict to read the dict in the same order as the corresponding items in the original list:
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {
executor.submit(load_url, url, 60): (i, url) for i, url in enumerate(URLS)
}
futures = {}
for future in concurrent.futures.as_completed(future_to_url):
i, url = future_to_url[future]
futures[i] = url, future
for i in range(len(futures)):
url, future = futures[i]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
If you have an iterable of sites that you want to map and you want to pass the same search_term and pages argument to each call. You can use zip to create an iterable that returns tuples of 3 elements where the first is your list of sites and the 2nd and 3rd are the other parameters just repeating using itertools.repeat
def func(site, search_term, pages):
...
from functools import partial
from itertools import repeat
executor.map(func, zip(sites, repeat(search_term), repeat(pages)))
Here is a useful way of using kwargs in executor.map, by simply using a lambda function to pass the kwargs with the **kwargs notation:
from concurrent.futures import ProcessPoolExecutor
def func(arg1, arg2, ...):
....
items = [
{ 'arg1': 0, 'arg2': 3 },
{ 'arg1': 1, 'arg2': 4 },
{ 'arg1': 2, 'arg2': 5 }
]
with ProcessPoolExecutor() as executor:
result = executor.map(
lambda kwargs: func(**kwargs), items)
I found this also useful when using Pandas DataFrames by creating the items with to_dict by typing items = df.to_dict(orient='records') or loading data from JSON files.