Part 1. Data Set download, cleaning and preparation¶
The set is taken from here:
https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption#
The file has extention .txt and it contains a semicolon delimited table with a header.
I start with downloading Python modules.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import date
from datetime import timedelta
Data download and initial check¶
Let us download the data, look at first rows and check dimensions.
power_consumption = pd.read_csv('household_power_consumption.txt', sep=';', low_memory=False)
power_consumption.head(), power_consumption.shape
As we see the table has 2 million rows, and the date column has European standard: a day number before a month number.
What are our columns types?
power_consumption.dtypes
What is a range for the numbers of records for a day?
number_of_entries_per_day = dict()
for day in power_consumption.Date:
if day in number_of_entries_per_day:
number_of_entries_per_day[day] +=1
else:
number_of_entries_per_day[day] = 1
frequences = number_of_entries_per_day.values()
min(frequences), max(frequences)
plt.hist(frequences, bins=10)
plt.show()
As we see a vast majority of days have about the same amount of records, which is good for analysis.
How many observations we have in an hour during 24-hour period?
number_of_entries_per_day_and_hour = dict()
for day, entry in zip(power_consumption.Date, power_consumption.Time):
hour = entry[:2]
if (day,hour) in number_of_entries_per_day_and_hour:
number_of_entries_per_day_and_hour[(day,hour)] +=1
else:
number_of_entries_per_day_and_hour[(day,hour)] = 1
frequences = number_of_entries_per_day_and_hour.values()
min(frequences), max(frequences)
plt.hist(frequences, bins=10)
plt.show()
The next question is: do we have any missing days?
We can figure out that years for our data are 2006, 2007, 2008, 2009, 2010, because the first year is 2006 and we have 47 month of data. Recall that a year in our Date
column takes last 4 places.
days_in_a_year = {2006: (0), 2007:(0), 2008: (0), 2009:(0), 2010:(0)}
for day in number_of_entries_per_day:
year = int(day[-4:])
days_in_a_year[year] += 1
days_in_a_year
Clearly we are fine with years 2007, 2008, 2009. We are good even with 2006 if we think about it: the only days for it are in December, and here are 31 days in the month. We start our records from 16th, so 15 days in December were without records, and 31-15 = 16. The year 2010 is more complicated. But we can get enumeration of year days in our data for the year, and check if the list contains all intermediate numbers.
year2010days = [pd.to_datetime(date).strftime('%j') for date in number_of_entries_per_day.keys() if date[-4:] == '2010']
min(year2010days), max(year2010days)
We obviously have some missing days. Let us see which ones.
missing_dates = list(range(1, 346))
for i in year2010days:
missing_dates.remove(int(i))
print("A number of missing days in 2010 is {}. Here they are:".format(len(missing_dates)))
missing_dates.sort()
first_day_2010 = date(2010, 1, 1)
for i in missing_dates:
print((first_day_2010 + timedelta(days=i-1)).strftime("%A %d, %B %Y"))
print("The last day of recording is {}.".format((first_day_2010 + timedelta(days=344)).strftime("%A %d, %B %Y")))
Something is strange with 12th day of a month. And a few days in the end of the November are missing.
What is the proportion with respect to the total number of days?
15/(16+365+366+365+330)
So we have 1% more of missing data, which is above declared 1.25% in the data description. Still it is not very much, unless we decide to investigate what happens each month of year 2010.
Conversion into Date and Time types.¶
It looks like our Date and Time columns do not have missing values, or the previous cells would fail. I will transform them into Python date and time types, correspondingly. In addition I can create a timestamp column.
power_consumption["Date"] = pd.to_datetime(power_consumption["Date"])
power_consumption["Time"] = pd.to_timedelta(power_consumption["Time"])
power_consumption["Timestamp"] = power_consumption["Date"] + power_consumption["Time"]
Missing values as NaN¶
I will calculate missing values for each column.
power_consumption.isnull().sum()
We see missing values in the last column.
Discovery of masked missing values and their consequent replacement with NaNs.¶
Now I need to change data types so I can use them as numeric values. I will start with numerical values.
cols = list(power_consumption.columns)
cols.remove('Date'), cols.remove('Time'), cols.remove('Timestamp'), cols.remove('Sub_metering_3')
for col in cols:
power_consumption[col] = pd.to_numeric(power_consumption[col])
I see a non-numerical value in the table, "?". It appears at row 6839. Let us look at the row.
power_consumption.iloc[6839, :]
Alas, here are more missing values. They are masked as '?'. I would rather have NaN
instead.
power_consumption = power_consumption.replace('?', np.NaN)
power_consumption.isnull().sum(), power_consumption.iloc[6839, :]
Let us see what is a proportion of rows with missing values.
25979/2075259
So these are 1.25% of missing values about which we read on the data set home page.
Conversion to numerical values.¶
At this point the rest of column types may be switched to numerical.
for col in cols:
power_consumption[col] = pd.to_numeric(power_consumption[col])
power_consumption.dtypes
Here is a final check of data:
power_consumption.dtypes
power_consumption.head()
Here is my github repository with the noteboook: https://github.com/Mathemilda/Individual-household-electric-power-consumption-Data-Set