Wrangle the Messy Data

Figuring out how to clean the data becuase no one is sure on a simple common practice.

Kaelan McGurk https://github.com/kaelanmcgurk
05-20-2021

What I have to work with

I worked with a couple different data frames. Some had call data, personal info, and arbitrary numbers in the system that represents status changes.

import pandas as pd

FebCleanedLeads = pd.read_csv('dataSets/cleanedLeads_2_11_21.csv')
FebStatusHistory = pd.read_csv('dataSets/status_history_2_11_21.csv')
leadInfoML = pd.read_csv('dataSets/leadInfoML_2_11_21.csv')

stc = pd.read_csv('dataSets/stc.csv')
stc = stc.drop(['Unnamed: 0', 'callStarted'], axis = 1)

The trouble mostly came when I worked with the table called ‘leads’. This is the table that holds all the data that describes the leads themselves. I wanted to look at the columns lead_source1, sales rep that is working with the lead, the date of the lead’s creation in the system, the State, city, and zip code of the lead, the electric company the lead uses, the square footage each solar install, and my “y” variable, the lead’s status in the system.

Why I removed all the NAs

I can feel the collective cringe of every data science and statistic professor that has taught me. Yes, I understand how bad this could be but I have my reasons!2 First of all, I am planning to make a model that will generalize a lead’s status in the system. I do not need every single data point to achieve that goal. Secondly, I can’t make predictions on a lead with little to no data available on them. Some of these leads only had a first name. That is not useful to my model. They have to go. Thirdly, this data set is large enough to where I can drop some rows and still have a usable set of data to train a model on. Is that decent reasoning? Am I OK to proceed with the rest of my project?

Yes?

Marvelous.

Investigating and cleaning very strange rows.

The first data set

As I have stated multiple times, the sales reps have no uniform practice when inputting data on a lead. Just looking in the state column I had rows that said “Arkansas”, “Oklahoma”, “Tennesse”, “AK”, “ok”, “Select a State3.” You get the picture. Zip code was even worse because no only did some zip codes have too many or not enough numbers, some of them weren’t even numbers. I had some inputs like “poasoejfpo”, “here”, and, my favorite, “????.”

Needless to say, I had my work cut out for me.

Luckily, I learned how to do this kind of wrangling in one of my classes and I used a very similar method here.

#%%
##########################
# Work with the new machine learning dataset here
#
##########################

leadInfoML = leadInfoML.dropna()

leadInfoML = leadInfoML.assign(
    zip_code = lambda x: x.zip_code
        .replace('poasoejfpo', '0')
        .replace('na','0')
        .replace('????','0')
        .replace('',np.nan)
        .replace(' ', np.nan)
        .replace('none', np.nan)
        .replace('NA1234', np.nan)
        .replace('n', np.nan)
        .replace('lattitude', np.nan)
        .replace('hotmail.co', np.nan)
        .replace('here', np.nan)
        .replace('Elmore Cit', np.nan)
        .replace('AR', np.nan)
        .replace('A', np.nan)
        .replace('74118821 e', np.nan)
        .replace('72641 35.9', np.nan)
        .replace('72712yu', np.nan)
        .fillna('0'),
    state = lambda x: x.state
        .replace('VA', 'Virginia')
        .replace('TN', 'Tennessee')
        .replace('OK', 'Oklahoma')
        .replace('NJ', 'New Jersey')
        .replace('MO', 'Missouri')
        .replace('CO', 'Colorado')
        .replace('AR', 'Arkansas')
        .replace('TX', 'Texas')
        .replace('Select a state', np.nan)
        .replace('KS', 'Kansas')
        .replace('IL', 'Illinois')
        .replace('MD', 'Maryland')
        .replace('CA', 'California')
        .replace('M', np.nan)
        .replace('FL', 'Florida')
        .fillna('noState'),
    electric_company = lambda x: x.electric_company
        .fillna('noCompany')
)

#%%
leadInfoML['state'] = leadInfoML['state'].str.strip()
leadInfoML['state'] = leadInfoML['state'].str.lower()

leadInfoML['electric_company'] = leadInfoML['electric_company'].str.lower()
leadInfoML['electric_company'] = leadInfoML['electric_company'].str.strip()

#%%

leadInfoML['isCust'] = [1 if x == 18 else 0 for x in leadInfoML['status']] 

#%%
combined = leadInfoML.merge(stc, left_on = 'lead_id', right_on = 'leadId')
combined = combined.query('state != "colorado" and state != "kansas" and state != "virginia" ').drop(['leadId'], axis = 1)

This technique of using the assign() function with the lambda x: function allowed me to access each column individually and work with it how I needed. I found the .value_counts() function quite useful when looking for items that did not belong in a certain column.

That is really all the cleaning I had to do for that set of data. I know it doesn’t seem like a lot, but I had to go through each column and figure out which points were correct, which points were slightly off, and which ones just needed to be replaced all together.

The second data set

I changed up my data as I was testing my machine learning models and found that the data I had been working with was not cutting it. So, I went and grabbed new data that focused heavily on the numbers I could pull from different tables in the database. (Check out We Need Data to see which tables I am talking about.)

These new data sets came with much less challenges and hurdles because I did most of the cleaning in the SQL query itself. Since there is little to no user input I didn’t have to deal with unusual data or misplaced data. This also meant I had to remove far less NAs.4 The only thing I had to do was merge them properly. But OH MY GOODNESS GRACIOUS does pandas make this SO DIFFCULT. Here is the link to their documentation on merging/linking two data frames. Merge, join, concatenate and compare.

Is it just me or was this super complicated to understand? It might just be me, but nevertheless it took me so long to figure it out properly. The function I really needed to be looking at was merge(). Once I figured that out, I was able to successfully merge the different tables together. This is all I had to do:

#%%
# Merge the new datasets so that they look better
leadsCombined = leadsAppt.merge(leadsCallLogs)
leadsData = leadsCombined.merge(leadsStatusHist)
leadsDat = leadsData.merge(stc, left_on = 'lead_id', right_on = 'leadId')
leadsFinal = leadsDat.merge(leadsStatus)
leadsFinal = leadsFinal.drop(['leadId'], axis = 1)


#%%
########################
# Clean up the leadsData 
#   just a bit
########################

leadsFinal = leadsFinal.assign(
    state = lambda x: x.state
        .replace('VA', 'Virginia')
        .replace('TN', 'Tennessee')
        .replace('OK', 'Oklahoma')
        .replace('NJ', 'New Jersey')
        .replace('MO', 'Missouri')
        .replace('CO', 'Colorado')
        .replace('AR', 'Arkansas')
        .replace('TX', 'Texas')
        .replace('Select a state', np.nan)
        .replace('KS', 'Kansas')
        .replace('IL', 'Illinois')
        .replace('MD', 'Maryland')
        .replace('CA', 'California')
        .replace('M', np.nan)
        .replace('FL', 'Florida')
        .fillna('noState') 
)

leadsFinal = leadsFinal.drop(['state'], axis = 1)
leadsFinal['isCust'] = [1 if x == 18 else 0 for x in leadsFinal['status']] 

I did end up dropping the column ‘state’ from this new combined data set because it was messing with my models. Other than that, the cleaning for this new data set was simple enough. With cleaned data, it is now time to run different models and get to the basis of this project; Can I predict which leads turn into customers?


  1. A lead source is how the lead came to investigate Shine Solar. These could be Facebook ads, Google ads, TV commercials, other people’s references, etc↩︎

  2. Be patient, Brother Hathaway↩︎

  3. I don’t even know where this place is!↩︎

  4. See, I told you it was going to be OK!↩︎