Part 2. Data Aggregation and Graphs.¶
This work continues from my previous post:
Individual household electric power consumption Data: Download and cleaning
The data are taken form here:
https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption#
In the previous post I learned that some of data is missing for a whole days. It means that doing daily averages for each year is not helpful, unless the missing days are skipped. Fortunately Pandas does it by default during aggregation, but we need to keep in mind what averages are meaningful and what are not.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
In the previous post the data was downloaded, a problem with missing values was found and fixed, correct types assigned to columns. All the data preparation is repeated in the following cell.
power_consumption = pd.read_csv('household_power_consumption.txt', sep=';', low_memory=False)
power_consumption["Date"] = pd.to_datetime(power_consumption["Date"])
power_consumption["Time"] = pd.to_timedelta(power_consumption["Time"])
power_consumption["Timestamp"] = power_consumption["Date"] + power_consumption["Time"]
power_consumption = power_consumption.replace('?', np.NaN)
cols = list(power_consumption.columns)
cols.remove('Date'), cols.remove('Time'), cols.remove('Timestamp'), cols.remove('Sub_metering_3')
for col in cols:
power_consumption[col] = pd.to_numeric(power_consumption[col])
power_consumption.shape
(2075259, 10)
Here is a reminder with what we work.
power_consumption.head()
Date | Time | Global_active_power | Global_reactive_power | Voltage | Global_intensity | Sub_metering_1 | Sub_metering_2 | Sub_metering_3 | Timestamp | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2006-12-16 | 0 days 17:24:00 | 4.216 | 0.418 | 234.84 | 18.4 | 0.0 | 1.0 | 17.0 | 2006-12-16 17:24:00 |
1 | 2006-12-16 | 0 days 17:25:00 | 5.360 | 0.436 | 233.63 | 23.0 | 0.0 | 1.0 | 16.0 | 2006-12-16 17:25:00 |
2 | 2006-12-16 | 0 days 17:26:00 | 5.374 | 0.498 | 233.29 | 23.0 | 0.0 | 2.0 | 17.0 | 2006-12-16 17:26:00 |
3 | 2006-12-16 | 0 days 17:27:00 | 5.388 | 0.502 | 233.74 | 23.0 | 0.0 | 1.0 | 17.0 | 2006-12-16 17:27:00 |
4 | 2006-12-16 | 0 days 17:28:00 | 3.666 | 0.528 | 235.68 | 15.8 | 0.0 | 1.0 | 17.0 | 2006-12-16 17:28:00 |
By description Global_active_power
has different units (in kilowatt), than other columns which measure similar values, which are sub-meterings 1, 2 and 3. We see in data description a formula:
(global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3)
It represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3. As we see the sub-meterings measurements are done for appliances consuming a lot of power.
I will add another column which represents this value. At first I convert Global_active_power
to the same units as others:
power_consumption["global_active_power_in_WH"] = power_consumption.Global_active_power*1000/60
And I want to introduce a separate value for Lighting and Small Appliances.
power_consumption["lighting_small_appliances"] = (power_consumption.Global_active_power*1000/60 -
power_consumption.Sub_metering_1 -
power_consumption.Sub_metering_2 -
power_consumption.Sub_metering_3)
Hourly averages¶
Plotting our raw data is difficult because we have way too much. So at first I computed hourly averages for each day of year. Note that the mean
method ignores NaN
values.
power_consumption['Hour'] = power_consumption['Time'].dt.components['hours']
power_consumption_averaged_byHour = power_consumption.groupby(['Hour',]).mean()
power_consumption_averaged_byHour[["global_active_power_in_WH", "lighting_small_appliances",
"Sub_metering_1", "Sub_metering_2", "Sub_metering_3"] ].head()
global_active_power_in_WH | lighting_small_appliances | Sub_metering_1 | Sub_metering_2 | Sub_metering_3 | |
---|---|---|---|---|---|
Hour | |||||
0 | 10.990570 | 7.104149 | 0.364378 | 0.555392 | 2.966651 |
1 | 8.988753 | 5.989900 | 0.248958 | 0.397963 | 2.351931 |
2 | 8.010343 | 5.487186 | 0.145858 | 0.350133 | 2.027166 |
3 | 7.414439 | 5.304533 | 0.067765 | 0.350061 | 1.692080 |
4 | 7.397454 | 5.149989 | 0.047142 | 0.334048 | 1.866275 |
Below we can see consumed energy for different categories.
power_consumption_averaged_byHour = power_consumption.groupby(['Hour',]).mean()
fig, axs = plt.subplots(5, figsize=(12, 5.5), sharex='col')
axs[0].plot(power_consumption_averaged_byHour.global_active_power_in_WH.values,
color='purple')
axs[0].text(0, 26, 'Global_active_power')
axs[1].plot(power_consumption_averaged_byHour.lighting_small_appliances.values,
color='purple')
axs[1].text(0, 16, 'Lighting and small appliances')
axs[2].plot(power_consumption_averaged_byHour.Sub_metering_1.values,
color='teal')
axs[2].text(0, 2.2, 'A dishwasher, an oven and a microwave')
axs[3].plot(power_consumption_averaged_byHour.Sub_metering_2.values,
color='brown')
axs[3].text(0, 2, 'A washing-machine, a drier, a refrigerator, a light')
axs[4].plot(power_consumption_averaged_byHour.Sub_metering_3.values,
color='red')
axs[4].text(0, 10, 'A water-heater, an air-conditioner')
plt.xticks(ticks = list(range(0,24)), labels = list(range(0, 24)))
plt.show()
We learned that the device consuming the most electricity is a water heater. And clearly people sleep at night.
In electric power industry knowledge about ranges for voltage and intencity is crucial, as well as what kind of common fluctuations may be here. In particular, how much can be an intensity surge and how many are such surges? We can check it out with histograms for our data. And power system equipments are designed to operate within ±5% of the nominal voltages.
Remembering about Nan
values helps to avoid error messages.
Hourly consumption¶
Note that to see how much energy was consumed in an hour we heed to sum it up.
power_consumption_averaged_byHour = power_consumption.groupby(['Hour',]).sum()
complete_rows = ~np.isnan(power_consumption.Global_intensity)
fig, axs = plt.subplots(1,2, figsize=(14, 4))
axs[0].hist(power_consumption.Global_intensity[complete_rows].values, bins=100, color="C1")
axs[0].text(25, 290000, 'Global intensity')
axs[1].hist(power_consumption.Voltage[complete_rows].values, bins=100, color="C4")
axs[1].text(225, 62000, 'Voltage')
plt.show()
As we see the 'Global intensity' magnitude may increase above 40 ampere, although rarely. With our scale we cannot see how many times this has happened. In addition the distribution of 'Global intensity' is not only non-normal, but it is skewed and bimodal. So statistical methods should be used with caution. The standard voltage in France is 230 volt, and our values are way off. Which means that appliances must be engineered to withstand such fluctuations.
We need to check how often Global Intensity is too high because our security measures depend on it.
Let us see how often 'Global intensity' went over 45 amperes.
((power_consumption[complete_rows])[(power_consumption.Global_intensity[complete_rows].values > 40)]).Date.dt.date.unique()
array([datetime.date(2007, 1, 13), datetime.date(2007, 2, 22), datetime.date(2007, 4, 3), datetime.date(2007, 5, 3), datetime.date(2007, 11, 11), datetime.date(2007, 12, 23), datetime.date(2007, 12, 28), datetime.date(2008, 1, 26), datetime.date(2008, 1, 27), datetime.date(2008, 12, 4), datetime.date(2008, 10, 19), datetime.date(2008, 11, 30), datetime.date(2009, 1, 25), datetime.date(2009, 2, 22), datetime.date(2009, 11, 24), datetime.date(2010, 11, 20)], dtype=object)
It looks like only 8 days in 3 years. But each time such surge may cause a power shutdown. Although there seems to be none for 4 years, because "Global_intensity" never dropped too low:
((power_consumption[complete_rows])[(power_consumption.Global_intensity[complete_rows].values < .2)]).Date.dt.date.unique()
array([], dtype=object)
Or maybe power shutdowns have happened during missing data days.
Weekly usage¶
Let us check how the power usage differs during a week. For this I will split my data between weekday usage and weekend usage. At first I extract a weekday number as a new column. In Python days of week may be presented as a decimal number, where 0 is Sunday and 6 is Saturday.
power_consumption['DayOfWeek'] = pd.to_numeric(power_consumption['Date'].apply(lambda x: x.strftime('%w')))
Now I can separate the data for weekend and weekday usage and compute averages for each hour as before.
power_consumption_weekend = power_consumption[power_consumption.DayOfWeek % 6==0]
power_consumption_weekday = power_consumption[power_consumption.DayOfWeek % 6!=0]
power_consumption_means_weekend = power_consumption_weekend.groupby(['Hour' ]).mean()
power_consumption_means_weekday = power_consumption_weekday.groupby(['Hour']).mean()
Here I plot and check what are power usage dynamics during a Saturday and Sunday. Black color is for weekday values and red is for weekend ones.
fig, axs = plt.subplots(2,2, figsize=(14, 6), sharex='col')
axs[0,0].plot(power_consumption_means_weekday.lighting_small_appliances.values, color='black')
axs[0,0].plot(power_consumption_means_weekend.lighting_small_appliances, color='red')
axs[0,0].text(0, 18, 'Lighting and small appliances')
axs[0,1].plot(power_consumption_means_weekday.Sub_metering_3.values, color='black', label ="Weekday")
axs[0,1].plot(power_consumption_means_weekend.Sub_metering_3.values, color='red', label ="Weekend")
axs[0,1].text(0, 12, 'A water-heater,\n an air-conditioner')
axs[1,0].plot(power_consumption_means_weekday.Sub_metering_2.values, color='black')
axs[1,0].plot(power_consumption_means_weekend.Sub_metering_2.values, color='red')
axs[1,0].text(0, 2.5, 'A washing-machine, a drier, \n a refrigerator, a light')
axs[1,1].plot(power_consumption_means_weekday.Sub_metering_1.values, color='black')
axs[1,1].plot(power_consumption_means_weekend.Sub_metering_1.values, color='red')
axs[1,1].text(0, 2.5, 'A dishwasher, an oven and \n a microwave')
#for i in range(2):
# axs[0,i].set_xticks(list(range(0,1440, 60)))
# axs[0,i].set_xticklabels(list(range(0, 24)))
fig.legend(loc='center right')
plt.show()
Electrical engineers can use in their work such helpful data interference as relationship between voltage and power intensity.
plt.scatter(power_consumption.Voltage[complete_rows].values,
power_consumption.Global_active_power[complete_rows].values, s=.5)
plt.xlabel("Voltage")
plt.ylabel("Global active power")
plt.show()