Place both series in Python's set container then use the set intersection method:
s1.intersection(s2)
and then transform back to list if needed.
Just noticed pandas in the tag. Can translate back to that:
pd.Series(list(set(s1).intersection(set(s2))))
From comments I have changed this to a more Pythonic expression, which is shorter and easier to read:
Series(list(set(s1) & set(s2)))
should do the trick, except if the index data is also important to you.
Have added the list(...) to translate the set before going to pd.Series as pandas does not accept a set as direct input for a Series.
Place both series in Python's set container then use the set intersection method:
s1.intersection(s2)
and then transform back to list if needed.
Just noticed pandas in the tag. Can translate back to that:
pd.Series(list(set(s1).intersection(set(s2))))
From comments I have changed this to a more Pythonic expression, which is shorter and easier to read:
Series(list(set(s1) & set(s2)))
should do the trick, except if the index data is also important to you.
Have added the list(...) to translate the set before going to pd.Series as pandas does not accept a set as direct input for a Series.
Setup:
s1 = pd.Series([4,5,6,20,42])
s2 = pd.Series([1,2,3,5,42])
Timings:
%%timeit
pd.Series(list(set(s1).intersection(set(s2))))
10000 loops, best of 3: 57.7 µs per loop
%%timeit
pd.Series(np.intersect1d(s1,s2))
1000 loops, best of 3: 659 µs per loop
%%timeit
pd.Series(np.intersect1d(s1.values,s2.values))
10000 loops, best of 3: 64.7 µs per loop
So the numpy solution can be comparable to the set solution even for small series, if one uses the values explicitly.
Working with sets, lists and dicts in pandas is a bit problematic, because best working with scalars:
df['k'] = [x[0] & x[1] for x in zip(df['i'], df['j'])]
print (df)
i j k
0 {1, 2, 3, 4} {2, 3} {2, 3}
1 {1, 2, 3, 4} {1} {1}
2 {1, 2, 3, 4} {4} {4}
3 {1, 2, 3, 4} {3, 4} {3, 4}
df['k'] = [x[0].intersection(x[1]) for x in zip(df['i'], df['j'])]
print (df)
i j k
0 {1, 2, 3, 4} {2, 3} {2, 3}
1 {1, 2, 3, 4} {1} {1}
2 {1, 2, 3, 4} {4} {4}
3 {1, 2, 3, 4} {3, 4} {3, 4}
Solution with apply:
df['k'] = df.apply(lambda x: x['i'].intersection(x['j']), axis=1)
print (df)
i j k
0 {1, 2, 3, 4} {2, 3} {2, 3}
1 {1, 2, 3, 4} {1} {1}
2 {1, 2, 3, 4} {4} {4}
3 {1, 2, 3, 4} {3, 4} {3, 4}
You can reproduce the set intersection using set differences. The intersection between A and B is equal to A minus the elements of A that are not in B. (You can symmetrical do it using B).
So, you can use dataframe sub method to operate set differences:
df['k'] = df['i'].sub(df['i'].sub(df['j']))
# df['k'] = df['j'].sub(df['j'].sub(df['i'])) # equivalent
Which gives the expected output:
df
Out[11]:
i j k
0 {1, 2, 3, 4} {2, 3} {2, 3}
1 {1, 2, 3, 4} {1} {1}
2 {1, 2, 3, 4} {4} {4}
3 {1, 2, 3, 4} {3, 4} {3, 4}
Index in pandas is a NumPy array. As such, it is going to have a worse performance characteristic for set operations than Python set which is optimized for such an operation - underlying implementation is a hash map which greatly reduces the time complexity of checking if a value is in a set to O(1).
For the NumPy array optimization is for quick traversal, so it won't be ever so fast to perform an operation alluding to set operation by its name but actually performed in a much different way.
In your particular situation the gain may be in the elegance of the call to one method instead of using an expression that is somewhat more cryptic on the first glance.
The accepted answer is wrong in equating Pandas Index to a NumPy Array. In reality, Pandas Index is based on a hash table, which is why it can only contain hashable objects (or supposed to).
Pandas internal method is slower because it is not optimised for intersecting unordered Indexes. If you look into source code as of 2024 (ver 2.2), you will see that ix_a.intersection(ix_b) makes several fast path checks and then defaults to building an indexer from ix_b to ix_a (or other way around, not sure). In other words, to answer the question
what elements do
ix_aandix_bhave in common?
it first answers the question
where are elements of
ix_alocated inix_b?
which is a more difficult question and requires to do more work than needed.
Now if your Indexes are ordered (they contain elements that increase or decrease), then ix_a.intersection(ix_b) will outperform Python built-in sets (in some cases for sure) by taking a fast path and taking advantage of the order. I suppose, Pandas just traverses both arrays in a "merge-sort" fashion.
Using the answer here, apply it to the dataframe row by row:
df[['A', 'B', 'C']].apply(
lambda row: list(set.intersection(*[set(row[col]) for col in row.index])),
axis=1
)
Note that when applying a function by row, the row's index values are the original dataframe's columns.
df[['A','B','C']].apply(lambda x : list(set.intersection(*map(set,list(x)))),axis=1 )
Out[1192]:
0 [2]
1 [3, 4]
2 []
dtype: object