Monthly Archives: March 2019

Pandas.Dataframe Cheat Sheet

Everything here assumes you’ve imported pandas as pd and numpy as np

import pandas as pd
import numpy as np

Creating Dataframe

df = pd.read_csv('my_csv.csv', index_col=0, skiprows=1)
df = pd.read_excel('my_excel.xls', skiprows=1)
dict = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d, dtype=np.float64)
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                    columns=['a', 'b', 'c']))
> a b c
0 1 2 3
1 4 5 6

Selecting Column(s)

Selecting a single column with a single set of [] will return a Series

df['a']

Using nested [[]] will allow you to select multiple columns and return a DataFrame

df[['a']]

Selecting Rows

Use Dataframe attribute .iloc to select row by index position

df.iloc[0] # returns first row of data as Series

Use Dataframe attribute .loc to select row by index name

df = pd.DataFrame(np.array([[1,2,3], [4,5,6]]), 
                  columns=['a','b','c'], 
                  index=['row-A','row-B'])
df.loc['row-B'] # returns the 2nd row

Conditionally select rows

df[df['a']>2] # select all rows where the value in column 'a' is greater than 2

Selecting Cells

df.iloc[0,1] # select cell at row 0 and column 1

You can also select a range of columns and rows

df.loc[0:1, 'a':'c'] # select rows 0 to 1 and columns 'a' to 'c'

Iterating Over Rows

numGreaterThanTwo = 0
for index, row in df.iterrows():
    if row['a'] > 2:
        numGreaterThanTwo += 1

Dropping Rows

df = df.drop([0,1], axis=0) # drop 1st and 2nd rows

Dropping Columns

df = df.drop(['col1','col2'], axis=1)

Mapping Values

dict = {1:'one': 2:'two', 3:'three'}
df['a'] = df['a'].replace(dict)

Manipulating Indexes

Reseting Indexes

df = df.reset_index()

Setting Indexes

df = df.set_index(['col1','col2'])

Setting MultiIndexes

tuples = list(zip(*[df['a'], df['b']]))
index = pd.MultiIndex.from_tuples(tuples, names=('a', 'b'))
df = df.set_index(index)

Read more about MultiIndex here

Merging Dataframes

df1 = ...
df2 = ...
# merge 2 dataframes on their indexes using inner join
merged_df = pd.merge(df1, df2, left_index=True, right_index=True, how="inner")
Advertisements
Tagged , , ,