We Need Data

Describe my source of data and my journey in aquiring said data.

Kaelan McGurk https://github.com/kaelanmcgurk
05-10-2021

Can’t do Much with Nothing

As my grandfather always said, “You can’t be a data scientist without data!” Well… he never said that but the sentiment is TRUE. For me to go anywhere with this project, I will need to get the correct data for each lead in Shine’s system.

Luckily for me, I spent my entire internship creating visualizations for the sales teams and executives to keep track of expenses, rep reports, and lead/customer data. After around four months of work in their relational database, I felt confident I knew which data I needed. The only problem was getting it OUT of said relational database. The entirety of my internship had me locked in either MySQL workbench or a Power BI knockoff called Zoho Analytics. Zoho is a decent business intelligence software. It is rather user friendly and I can create custom SQL queries if I needed to do some light data wrangling. The problem with that is neither of those interfaces had the capability of creating machine learning models. I had no other choice.

I had to bust out Python.

-
Hey, that's me!
-

Yes, very good, Python.

Link to the database

Now, unfortunately for me, getting the database hooked up to my local machine was no easy task. If you would like a more in depth explanation of my 2 and a half month long battle with python I can tell it but long story short; the Google hosted database did not play well with VS Code. I had to download a dump of the database onto my computer and link it up to python with a package. This wasn’t the live database, but it is enough data to begin my project. Here is the code I used to import the database:

import mysql.connector
import pandas as pd

def callLuminary():
    LuminaryDB = mysql.connector.connect(
    host="localhost",
    port="3306",
    user="THIS IS NOT THE USERNAME",
    password="THIS IS NOT THE PASSWORD",
    database="NAME OF THE DATABASE"
    )
    return LuminaryDB
def makeQueryDataFarme(dataBase, sql):
  mycursor = dataBase.cursor()
  mycursor.execute(sql)
  myresult = mycursor.fetchall()
  return(pd.DataFrame(myresult, columns=mycursor.column_names))

With these functions all I have to do is provide the name of the database and the SQL query that I need to run it. Simple, clean, and ready for use.

Along with the local database, I can always just directly query the live databsae from the web app Shine uses and load in the downloaded .csv file. Either way works well for my purposes.

What tables to use

The data that comes from leads is a little scattered around the database. However, I am awesome at my job (and have helpful coworkers) and I was able to find the best tables to pull from. In the database I am pulling the table ‘leads’. Very unique name, I know, but very helpful when searching a database of 130+ tables. This table covers a wide range of data that the lead inputs into our system: Name, state, zip code, electric company they use, the date they were created in the system, etc. Perfect! Such great data all in one place! This should be pretty simple. This table has a bunch of columns that are useless to me, so I will just not select those.

leads = '''
SELECT lead_source,
    owner,
    YEAR(date_created) AS yearCreated,
    MONTH(date_created) AS monthCreated,
    DAY(date_created) AS dayCreated,
    DAYNAME(date_created) AS dayName,
    city,
    state,
    zip_code,
    electric_company,
    square_footage,
    status AS y
FROM leads
WHERE owner != 6585993;
'''
leadsData = makeQueryDataFarme(luminary, leads) # Remember? From the custom function?

I have data on the date a lead was input into the system, their general location, the electric company they use, and, of course, their status in the system saying if they are a customer or not.

I ran and played with this data a bunch. I cleaned it (See Wrangle the Messy Data) and ran it through my different machine learning models. (See Branches of Machine Learning). Going through some of my methods in machine learning, I realized I had to get different data. (Again, see Branches of Machine Learning)

I decided that pulling data just from one table was quite lacking. I needed more diversity. So I looked into the tables ‘call_logs’ and ‘status_history’ as well. Call_logs has all the call information that happens between leads and sales reps. It holds columns like ‘to_number’, ‘from_number’, time stamp of the start and end of the call, the sales rep that called, etc. Status_history holds each status that a lead has gone through. There are many statuses in our system. In particular, I am looking for the statuses a lead goes through before they hit customer. Those statuses include New Lead, Appointment Scheduled, Do Not Contact, Working, Attempted Contact, among many others.

From the new tables I can use these wonderful queries:

statusHistory = '''
SELECT
         l.lead_id AS lead_id,
         COUNT(sh.status) AS numberOfStatuses
FROM leads l
JOIN status_history sh ON l.lead_id = sh.lead
WHERE sh.agent  != 6585993 AND l.date_created > '2021/03/05' AND l.status != 18
GROUP BY  l.lead_id;
'''

callLogs = '''
SELECT
         l.lead_id AS lead_id,
         COUNT(cl.call_id) AS numberOfCalls,
         ROUND(SUM(TIMESTAMPDIFF(SECOND, cl.call_started, cl.call_completed)) / 60, 2) AS callTimeMinute
FROM  leads l
JOIN call_logs cl ON l.lead_id = cl.lead
WHERE cl."user"  != 6585993 AND l.date_created > '2021/03/05' AND l.status != 18
GROUP BY  l.lead_id;
'''

stc = '''
SELECT
    l.lead_id AS leadId,
    l.date_created as dateCreated,
    cl.call_started AS callStarted,
    TIMESTAMPDIFF(HOUR, l.date_created, cl.call_started) AS hoursSinceCall
FROM leads l
JOIN call_logs cl ON l.lead_id = cl.lead
JOIN lead_sources ls ON ls.source_id = l.lead_source
WHERE cl.user  != 6585993 AND 
    l.date_created > '2021/03/05' AND 
    l.status != 18 AND 
    cl.call_started > l.date_created AND 
    LOCATE("Scheduled", ls.source_name) = 0;
'''
    
stc = pd.read_csv('testingData/stcNew.csv')
callLogs = pd.read_csv('testingData/callLogsNew.csv')
statusHistory = pd.read_csv('testingData/statusHistoryNew.csv')

These new columns of data feature only numbers. I did this because I ran into some trouble when working with all the categorical columns in my machine learning models (Just… just go check out Branches of Machine Learning. I promise it will make more sense then.) I wanted to look at how many calls sales reps send to leads, how long those calls total to, how many statuses leads go through when becoming customers, and how quickly sales reps first call a lead when they initially enter the system.

We now have data

And those are the columns I need! I found the data in the database, grabbed the columns I needed, ran through the models I made, and pulled more data when I found it necessary.

Now that I have this data, it needs to be cleaned because

WHOO BOY

Sales reps do not have a common practice of inputting data.