Figuring out how to clean the data becuase no one is sure on a simple common practice.
I worked with a couple different data frames. Some had call data, personal info, and arbitrary numbers in the system that represents status changes.
import pandas as pd
= pd.read_csv('dataSets/cleanedLeads_2_11_21.csv')
FebCleanedLeads = pd.read_csv('dataSets/status_history_2_11_21.csv')
FebStatusHistory = pd.read_csv('dataSets/leadInfoML_2_11_21.csv')
leadInfoML
= pd.read_csv('dataSets/stc.csv')
stc = stc.drop(['Unnamed: 0', 'callStarted'], axis = 1) stc
The trouble mostly came when I worked with the table called ‘leads’. This is the table that holds all the data that describes the leads themselves. I wanted to look at the columns lead_source1, sales rep that is working with the lead, the date of the lead’s creation in the system, the State, city, and zip code of the lead, the electric company the lead uses, the square footage each solar install, and my “y” variable, the lead’s status in the system.
I can feel the collective cringe of every data science and statistic professor that has taught me. Yes, I understand how bad this could be but I have my reasons!2 First of all, I am planning to make a model that will generalize a lead’s status in the system. I do not need every single data point to achieve that goal. Secondly, I can’t make predictions on a lead with little to no data available on them. Some of these leads only had a first name. That is not useful to my model. They have to go. Thirdly, this data set is large enough to where I can drop some rows and still have a usable set of data to train a model on. Is that decent reasoning? Am I OK to proceed with the rest of my project?
Yes?
Marvelous.
As I have stated multiple times, the sales reps have no uniform practice when inputting data on a lead. Just looking in the state column I had rows that said “Arkansas”, “Oklahoma”, “Tennesse”, “AK”, “ok”, “Select a State3.” You get the picture. Zip code was even worse because no only did some zip codes have too many or not enough numbers, some of them weren’t even numbers. I had some inputs like “poasoejfpo”, “here”, and, my favorite, “????.”
Needless to say, I had my work cut out for me.
Luckily, I learned how to do this kind of wrangling in one of my classes and I used a very similar method here.
#%%
##########################
# Work with the new machine learning dataset here
#
##########################
= leadInfoML.dropna()
leadInfoML
= leadInfoML.assign(
leadInfoML = lambda x: x.zip_code
zip_code 'poasoejfpo', '0')
.replace('na','0')
.replace('????','0')
.replace('',np.nan)
.replace(' ', np.nan)
.replace('none', np.nan)
.replace('NA1234', np.nan)
.replace('n', np.nan)
.replace('lattitude', np.nan)
.replace('hotmail.co', np.nan)
.replace('here', np.nan)
.replace('Elmore Cit', np.nan)
.replace('AR', np.nan)
.replace('A', np.nan)
.replace('74118821 e', np.nan)
.replace('72641 35.9', np.nan)
.replace('72712yu', np.nan)
.replace('0'),
.fillna(= lambda x: x.state
state 'VA', 'Virginia')
.replace('TN', 'Tennessee')
.replace('OK', 'Oklahoma')
.replace('NJ', 'New Jersey')
.replace('MO', 'Missouri')
.replace('CO', 'Colorado')
.replace('AR', 'Arkansas')
.replace('TX', 'Texas')
.replace('Select a state', np.nan)
.replace('KS', 'Kansas')
.replace('IL', 'Illinois')
.replace('MD', 'Maryland')
.replace('CA', 'California')
.replace('M', np.nan)
.replace('FL', 'Florida')
.replace('noState'),
.fillna(= lambda x: x.electric_company
electric_company 'noCompany')
.fillna(
)
#%%
'state'] = leadInfoML['state'].str.strip()
leadInfoML['state'] = leadInfoML['state'].str.lower()
leadInfoML[
'electric_company'] = leadInfoML['electric_company'].str.lower()
leadInfoML['electric_company'] = leadInfoML['electric_company'].str.strip()
leadInfoML[
#%%
'isCust'] = [1 if x == 18 else 0 for x in leadInfoML['status']]
leadInfoML[
#%%
= leadInfoML.merge(stc, left_on = 'lead_id', right_on = 'leadId')
combined = combined.query('state != "colorado" and state != "kansas" and state != "virginia" ').drop(['leadId'], axis = 1) combined
This technique of using the assign()
function with the lambda x:
function allowed me to access each column individually and work with it how I needed. I found the .value_counts()
function quite useful when looking for items that did not belong in a certain column.
That is really all the cleaning I had to do for that set of data. I know it doesn’t seem like a lot, but I had to go through each column and figure out which points were correct, which points were slightly off, and which ones just needed to be replaced all together.
I changed up my data as I was testing my machine learning models and found that the data I had been working with was not cutting it. So, I went and grabbed new data that focused heavily on the numbers I could pull from different tables in the database. (Check out We Need Data to see which tables I am talking about.)
These new data sets came with much less challenges and hurdles because I did most of the cleaning in the SQL query itself. Since there is little to no user input I didn’t have to deal with unusual data or misplaced data. This also meant I had to remove far less NAs.4 The only thing I had to do was merge them properly. But OH MY GOODNESS GRACIOUS does pandas
make this SO DIFFCULT. Here is the link to their documentation on merging/linking two data frames. Merge, join, concatenate and compare.
Is it just me or was this super complicated to understand? It might just be me, but nevertheless it took me so long to figure it out properly. The function I really needed to be looking at was merge()
. Once I figured that out, I was able to successfully merge the different tables together. This is all I had to do:
#%%
# Merge the new datasets so that they look better
= leadsAppt.merge(leadsCallLogs)
leadsCombined = leadsCombined.merge(leadsStatusHist)
leadsData = leadsData.merge(stc, left_on = 'lead_id', right_on = 'leadId')
leadsDat = leadsDat.merge(leadsStatus)
leadsFinal = leadsFinal.drop(['leadId'], axis = 1)
leadsFinal
#%%
########################
# Clean up the leadsData
# just a bit
########################
= leadsFinal.assign(
leadsFinal = lambda x: x.state
state 'VA', 'Virginia')
.replace('TN', 'Tennessee')
.replace('OK', 'Oklahoma')
.replace('NJ', 'New Jersey')
.replace('MO', 'Missouri')
.replace('CO', 'Colorado')
.replace('AR', 'Arkansas')
.replace('TX', 'Texas')
.replace('Select a state', np.nan)
.replace('KS', 'Kansas')
.replace('IL', 'Illinois')
.replace('MD', 'Maryland')
.replace('CA', 'California')
.replace('M', np.nan)
.replace('FL', 'Florida')
.replace('noState')
.fillna(
)
= leadsFinal.drop(['state'], axis = 1)
leadsFinal 'isCust'] = [1 if x == 18 else 0 for x in leadsFinal['status']] leadsFinal[
I did end up dropping the column ‘state’ from this new combined data set because it was messing with my models. Other than that, the cleaning for this new data set was simple enough. With cleaned data, it is now time to run different models and get to the basis of this project; Can I predict which leads turn into customers?