Videos
Use Series.str.split with select first values of lists by indexing:
df = pd.DataFrame({'col':['45+2','98+3','90+5']})
df['new'] = df['col'].str.split('+').str[0]
print (df)
col new
0 45+2 45
1 98+3 98
2 90+5 90
Or use Series.str.extract for first integers from values:
df['new'] = df['col'].str.extract('(\d+)')
print (df)
col new
0 45+2 45
1 98+3 98
2 90+5 90
You can use lambda function for doing this.
df1 = pd.DataFrame(data=['45+2','98+3','90+5'],columns=['col'])
print df1
col
0 45+2
1 98+3
2 90+5
Delete unwanted parts from the strings in the "col" column
df1['col'] = df1['col'].map(lambda x:x.split('+')[0])
print df1
col
0 45
1 98
2 90
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
Use np.array_split:
Docstring:
Split an array into multiple sub-arrays.
Please refer to the ``split`` documentation. The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : randn(8), 'D' : randn(8)})
In [3]: print df
A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468
In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[ A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837,
A B C D
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861,
A B C D
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468]
I wanted to do the same, and I had first problems with the split function, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!
# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
iloc
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]
(iloc docs)
use np.split(..., axis=1):
Demo:
In [255]: df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
In [256]: df
Out[256]:
a b c d e f
0 0.823638 0.767999 0.460358 0.034578 0.592420 0.776803
1 0.344320 0.754412 0.274944 0.545039 0.031752 0.784564
2 0.238826 0.610893 0.861127 0.189441 0.294646 0.557034
3 0.478562 0.571750 0.116209 0.534039 0.869545 0.855520
4 0.130601 0.678583 0.157052 0.899672 0.093976 0.268974
In [257]: dfs = np.split(df, [4], axis=1)
In [258]: dfs[0]
Out[258]:
a b c d
0 0.823638 0.767999 0.460358 0.034578
1 0.344320 0.754412 0.274944 0.545039
2 0.238826 0.610893 0.861127 0.189441
3 0.478562 0.571750 0.116209 0.534039
4 0.130601 0.678583 0.157052 0.899672
In [259]: dfs[1]
Out[259]:
e f
0 0.592420 0.776803
1 0.031752 0.784564
2 0.294646 0.557034
3 0.869545 0.855520
4 0.093976 0.268974
np.split() is pretty flexible - let's split an original DF into 3 DFs at columns with indexes [2,3]:
In [260]: dfs = np.split(df, [2,3], axis=1)
In [261]: dfs[0]
Out[261]:
a b
0 0.823638 0.767999
1 0.344320 0.754412
2 0.238826 0.610893
3 0.478562 0.571750
4 0.130601 0.678583
In [262]: dfs[1]
Out[262]:
c
0 0.460358
1 0.274944
2 0.861127
3 0.116209
4 0.157052
In [263]: dfs[2]
Out[263]:
d e f
0 0.034578 0.592420 0.776803
1 0.545039 0.031752 0.784564
2 0.189441 0.294646 0.557034
3 0.534039 0.869545 0.855520
4 0.899672 0.093976 0.268974
TL;DR version:
For the simple case of:
- I have a text column with a delimiter and I want two columns
The simplest solution is:
df[['A', 'B']] = df['AB'].str.split(' ', n=1, expand=True)
You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.
Notice how, in either case, the .tolist() method is not necessary. Neither is zip().
In detail:
Andy Hayden's solution is most excellent in demonstrating the power of the str.extract() method.
But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:
>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df
AB
0 A1-B1
1 A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df
AB AB_split
0 A1-B1 [A1, B1]
1 A2-B2 [A2, B2]
1: If you're unsure what the first two parameters of .str.split() do,
I recommend the docs for the plain Python version of the method.
But how do you go from:
- a column containing two-element lists
to:
- two columns, each containing the respective element of the lists?
Well, we need to take a closer look at the .str attribute of a column.
It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:
>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df
U
0 A
1 B
2 C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df
U L
0 A a
1 B b
2 C c
But it also has an "indexing" interface for getting each element of a string by its index:
>>> df['AB'].str[0]
0 A
1 A
Name: AB, dtype: object
>>> df['AB'].str[1]
0 1
1 2
Name: AB, dtype: object
Of course, this indexing interface of .str doesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:
>>> df['AB'].str.split('-', 1).str[0]
0 A1
1 A2
Name: AB, dtype: object
>>> df['AB'].str.split('-', 1).str[1]
0 B1
1 B2
Name: AB, dtype: object
Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do
>>> df['A'], df['B'] = df['AB'].str.split('-', n=1).str
>>> df
AB AB_split A B
0 A1-B1 [A1, B1] A1 B1
1 A2-B2 [A2, B2] A2 B2
Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:
>>> df['AB'].str.split('-', n=1, expand=True)
0 1
0 A1 B1
1 A2 B2
So, another way of accomplishing what we wanted is to do:
>>> df = df[['AB']]
>>> df
AB
0 A1-B1
1 A2-B2
>>> df.join(df['AB'].str.split('-', n=1, expand=True).rename(columns={0:'A', 1:'B'}))
AB A B
0 A1-B1 A1 B1
1 A2-B2 A2 B2
The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn't deal well with splits of different lengths:
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
AB
0 A1-B1
1 A2-B2
2 A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
[...]
ValueError: Length of values does not match length of index
>>>
But expand=True handles it nicely by placing None in the columns for which there aren't enough "splits":
>>> df.join(
... df['AB'].str.split('-', expand=True).rename(
... columns={0:'A', 1:'B', 2:'C'}
... )
... )
AB A B C
0 A1-B1 A1 B1 None
1 A2-B2 A2 B2 None
2 A3-B3-C3 A3 B3 C3
There might be a better way, but this here's one approach:
row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
columns = ['fips','row'])
fips row
0 00000 UNITED STATES
1 01000 ALABAMA
2 01001 Autauga County, AL
3 01003 Baldwin County, AL
4 01005 Barbour County, AL