A full implementation of what you want can be found here:
series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))
Answer from Sebastian Mendez on Stack OverflowA full implementation of what you want can be found here:
series_set = df.apply(frozenset, axis=1)
new_df = series_set.apply(lambda a: series_set.apply(lambda b: jaccard(a,b)))
You could get rid of the nested apply by vectorizing your function. First, get all pair-wise combinations and pass it to a vectorized version of your function -
def jaccard_similarity_score(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
i = df.apply(frozenset, 1).to_frame()
j = i.assign(foo=1)
k = j.merge(j, on='foo').drop('foo', 1)
k.columns = ['A', 'B']
fnc = np.vectorize(jaccard_similarity_score)
y = fnc(k['A'], k['B']).reshape(len(df), -1)
y
array([[ 1. , 0.5, 0.5, 0.5, 0.2, 0.2],
[ 0.5, 1. , 0.5, 0.2, 0.5, 0.2],
[ 0.5, 0.5, 1. , 0.2, 0.2, 0.5],
[ 0.5, 0.2, 0.2, 1. , 0.5, 0.5],
[ 0.2, 0.5, 0.2, 0.5, 1. , 0.5],
[ 0.2, 0.2, 0.5, 0.5, 0.5, 1. ]])
This is already faster, but let's see if we can get even faster.
Using senderle's fast cartesian_product -
def cartesian_product(*arrays):
la = len(arrays)
dtype = numpy.result_type(*arrays)
arr = numpy.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(numpy.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
i = df.apply(frozenset, 1).values
j = cartesian_product(i, i)
y = fnc(j[:, 0], j[:, 1]).reshape(-1, len(df))
y
array([[ 1. , 0.5, 0.5, 0.5, 0.2, 0.2],
[ 0.5, 1. , 0.5, 0.2, 0.5, 0.2],
[ 0.5, 0.5, 1. , 0.2, 0.2, 0.5],
[ 0.5, 0.2, 0.2, 1. , 0.5, 0.5],
[ 0.2, 0.5, 0.2, 0.5, 1. , 0.5],
[ 0.2, 0.2, 0.5, 0.5, 0.5, 1. ]])
How can I convert my pandas dataframe into this format?
``` sets items weight value
0 set1 a 9 10
1 set1 b 14 100
2 set2 c 5 69
3 set2 d 4 100
Outcome i'm looking for:
set1 = (("a", 9, 10), ("b", 14, 100))
set2 = (("c", 5, 69), ("d", 4, 100))
print(set1)
set1 = (("a", 9, 10), ("b", 14, 100))
python - Create a set from a series in pandas - Stack Overflow
pandas - Convert data frame into set using python - Stack Overflow
Python set to array and dataframe - Stack Overflow
python - How to convert list into set in pandas? - Stack Overflow
If you only need to get list of unique values, you can just use unique method.
If you want to have Python's set, then do set(some_series)
In [1]: s = pd.Series([1, 2, 3, 1, 1, 4])
In [2]: s.unique()
Out[2]: array([1, 2, 3, 4])
In [3]: set(s)
Out[3]: {1, 2, 3, 4}
However, if you have DataFrame, just select series out of it ( some_data_frame['<col_name>'] ).
With large size series with duplicates the set(some_series) execution-time will evolve exponentially with series size.
Better practice would be to set(some_series.unique()).
Pandas can't deal with sets (dicts are ok you can use p.DataFrame.from_dict(s) for those)
What you need to do is to convert your set into a list and then convert to DataFrame:
import pandas as pd
s = {12,34,78,100}
s = list(s)
print(pd.DataFrame(s))
You can use list(s):
import pandas as p
s = {12,34,78,100}
df = p.DataFrame(list(s))
print(df)
You should use apply method of DataFrame API:
df['uids'] = df.apply(lambda row: set(row['uids']), axis=1)
or
df = df['uids'].apply(set) # great thanks to EdChum
You can find more information about apply method here.
Examples of use
df = pd.DataFrame({'A': [[1,2,3,4,5,1,1,1], [2,3,4,2,2,2,3,3]]})
df = df['A'].apply(set)
Output:
>>> df
0 set([1, 2, 3, 4, 5])
1 set([2, 3, 4])
Name: A, dtype: object
Or:
>>> df = pd.DataFrame({'A': [[1,2,3,4,5,1,1,1], [2,3,4,2,2,2,3,3]]})
>>> df['A'] = df.apply(lambda row: set(row['A']), axis=1)
>>> df
A
0 set([1, 2, 3, 4, 5])
1 set([2, 3, 4])
For anyone who wants to know the fastest way to convert list into set in Pandas:
Method 1:
df['uids'] = df.apply(lambda row: set(row['uids']), axis=1)
Method 2:
df['uids'] = df['uids'].apply(set)
Method 3:
df['uids'] = df['uids'].map(set)
I run timeit with repeat(50, 5) on DF with 4000 rows:
Method 1 - mean: 0.13299, min: 0.12723
Method 2 - mean: 0.01319, min: 0.01207
Method 3 - mean: 0.01261, min: 0.01164
Use apply:
tdf['c'] = tdf['b'].apply(list)
Because using list is doing to whole column not one by one.
Or do:
tdf['c'] = tdf['b'].map(list)
You could do:
import pandas as pd
data = [{'a': [1,2,3], 'b':{11,22,33}},{'a':[2,3,4],'b':{111,222}}]
tdf = pd.DataFrame(data)
tdf['c'] = [list(e) for e in tdf.b]
print(tdf)