15 Most-used Python Commands for Data Cleaning
Python (and especially Pandas) is one of my most used packages at work as a Marketing Analyst. Although I also use SciPy and Scikit‑learn for statistical modeling needs, we all know over 50% of an analyst’s time is spent on data cleaning, where the 15 command lines below come in handy.
1.Get current directory
import os
os.getcwd()
2. Show the first two rows of a data frame
df.head(2)
3. Convert column names to list to view
df.columns.tolist()
4. Drop NA from a column
df.A=df.A.dropna()
5. Drop 2 columns from a data frame
df.drop(['A','B'],axis=1)
6. Format date
import pandas as pd
df[‘date’]=pd.to_datetime(df[‘date’],format=”%Y%m”)
7. Count the unique number of rows in column A
df[‘A’].nunique()
8. Count the number of each value in column A
df[‘A’].value_counts()
9. Remove rows where values in df column A is in df_b column A
df=df[-df[‘A’].isin(df_b[‘A’])]
10. Create a data frame grouped by column A and count unique values of the Email column
import pandas as pd
pd.DataFrame(df.groupby([‘A’])[‘Email’].nunique())
11. Replace values in a column
replace_values = {‘Small’ : 1, ‘Medium’ : 2, ‘High’ : 3 }
df = df.replace({'A': replace_values})
12. Append 3 data frames
append_data=open.append([clicks,delivers,not_delivered])
13. Turn a data frame into a pivot table
pd.pivot_table(df, index=[‘A’],columns=’B’, values=’Emails’)
14. Flatten a pivot table
import pandas as pd
pd.melt(df, id_vars=['A'])
15. Join 2 data frames based by merging a common column
import pandas as pd
pd.merge(df1,df2,how=’Outer’,left_on=[‘domain’],right_on=[‘domain’])