Everything here assumes you’ve imported pandas as pd and numpy as np
import pandas as pd
import numpy as np
Creating Dataframe
df = pd.read_csv('my_csv.csv', index_col=0, skiprows=1)
df = pd.read_excel('my_excel.xls', skiprows=1)
dict = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d, dtype=np.float64)
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c']))
> a b c
0 1 2 3
1 4 5 6
Selecting Column(s)
Selecting a single column with a single set of [] will return a Series
df['a']
Using nested [[]] will allow you to select multiple columns and return a DataFrame
df[['a']]
Selecting Rows
Use Dataframe attribute .iloc to select row by index position
df.iloc[0] # returns first row of data as Series
Use Dataframe attribute .loc to select row by index name
df = pd.DataFrame(np.array([[1,2,3], [4,5,6]]),
columns=['a','b','c'],
index=['row-A','row-B'])
df.loc['row-B'] # returns the 2nd row
Conditionally select rows
df[df['a']>2] # select all rows where the value in column 'a' is greater than 2
Selecting Cells
df.iloc[0,1] # select cell at row 0 and column 1
You can also select a range of columns and rows
df.loc[0:1, 'a':'c'] # select rows 0 to 1 and columns 'a' to 'c'
Iterating Over Rows
numGreaterThanTwo = 0
for index, row in df.iterrows():
if row['a'] > 2:
numGreaterThanTwo += 1
Dropping Rows
df = df.drop([0,1], axis=0) # drop 1st and 2nd rows
Dropping Columns
df = df.drop(['col1','col2'], axis=1)
Mapping Values
dict = {1:'one': 2:'two', 3:'three'}
df['a'] = df['a'].replace(dict)
Manipulating Indexes
Reseting Indexes
df = df.reset_index()
Setting Indexes
df = df.set_index(['col1','col2'])
Setting MultiIndexes
tuples = list(zip(*[df['a'], df['b']]))
index = pd.MultiIndex.from_tuples(tuples, names=('a', 'b'))
df = df.set_index(index)
Read more about MultiIndex here
Merging Dataframes
df1 = ...
df2 = ...
# merge 2 dataframes on their indexes using inner join
merged_df = pd.merge(df1, df2, left_index=True, right_index=True, how="inner")