Edit: This post is aimed at influxdb1, since influxdb2 got released the bigger part of this post got obsolete. I might write a post for influxdb2 in the future, but no guarantees
In this blog post I’ll elaborate how I managed to get long term data history for my home assistant setup with InfluxDB.
I’m writing this post since I’ve seen people on the internet struggle with this issue and as I learned more about this subject I thought I might as well share that knowledge.
Before I get into specifics I’d like to share a bit about my HA setup so you know what is going on. I’m currently running HassOS on a Raspberry Pi 4. I use a Zigbee stick in combination with Zigbee2Mqtt to control about 8 lights, 2 Shelly plugs and tap into the P1 port of my ‘smart’ electricity meter.
I also run a server with various dockerized services. Since HA’s default SQLite database shortens the lifespan of the SD card it’s installed on. I do have configured the recorder to store everything in the MariaDB instance on this server.
After some time running I’ve observed this database growing. HA is instructed to purge data after 32 days (I run a DB backup every month).
With the following sensor I check the current database size:
(db_url is the same secret as defined in the recorder setting)
- platform: sql
db_url: !secret mariadb_docker
- name: DB size
query: 'SELECT table_schema "database", Round(Sum(data_length + index_length) / 1024 / 1024 / 1024, 1) "value" FROM information_schema.tables WHERE table_schema="homeassistant" GROUP BY table_schema;'
With the current retention policy and the sensors mentioned earlier, the database had a size of 3.2GB. Although this is still manageable, It does not seem very efficient. A solution often mentioned is that you can exclude various entities or domains in the recorder config. This solution didn’t cut it for me as I want to store this data for evaluation over a longer period of time and already filtered out the ‘unwanted’ data.
Bonus tip: Check which entities are eating up your database space with this query:
ORDER BY COUNT(entity_id) DESC
You can also optimize your tables (a bit) with the following queries:
OPTIMIZE TABLE states;
OPTIMIZE TABLE events;
Most sensors record and store their values multiple times per minute (or second), these datapoints are useful for short term graphs, but it doesn’t really make sense to store these on the long term. It would be awesome if we could downsample this data to, for instance, a mean value per minute. For sensors logging every 10 seconds, this would save 6x the data. On even longer term, hourly mean values can be sufficient (depending on what you’re trying to achieve). There is probably some way you can do this with MariaDB, but it isn’t designed to do so, but InfluxDB is!
Wat is InfluxDB? According to the official documentation: “InfluxDB is a time series database designed to handle high write and query loads. It is an integral component of the TICK stack. InfluxDB is meant to be used as a backing store for any use case involving large amounts of timestamped data, including DevOps monitoring, application metrics, IoT sensor data, and real-time analytics.”
This sounds like it fits our needs :).
InfluxDB offers a function called retention policies. I highly recommend reading the documentation on their website. It comes down to certain policies which can be set to remove/purge the data after an X amount of time. In my case I want to achieve the following retention for all of my sensors:
|1 Minute mean value of measurements
|1 Hour mean value of measurements
I will probably edit this retention policy at some time, depending on how the database grows compared to the resources available.
To populate a retention policy, a continuous query is needed. These queries, select values and populate the new retention policy. More on that down below.
Before we continue I expect you to have a running Home assistant instance as well as a running InfluxDB and Chronograf instance (or a complete TICK stack). These can be easily fired up using docker and I will not go into specifics of configuring these.
Sending data to InfluxDB
We first need to set up HA, so it will send data to the InfluxDB instance.A dd the following to your configuration.yaml. Full configuration options can be found in the HA documentation.
host: !secret influxdb_host
username: !secret influxdb_user
password: !secret influxdb_password
I strongly advise to only forward the entities you want to keep on the long term, it is possible to forward all entities, which can make your InfluxDB cluttered.
After you restarted HA, check your chronograf instance for incoming data. Home assistant should automatically create a database called ‘home_assistant’. If not, you can create this database manually with the following query:
CREATE DATABASE home_assistant;
I recommend clicking some queries together in the explore tab and get familiar with the interface and explore your data!
Creating retention policies
In chronograf, it is pretty easy to create retention policies. ‘autogen’ is the default retention policy. This is a retention policy that goes infinite. We need to change that to the retention policy with the shortest time span in order to automatically populate that retention policy with the metrics. Replication factor is not applicable for single node clusters and should therefore be left on 1. You can create retention policies by clicking on ‘add retention policy button’, or by executing the query mentioned below.
The GUI provides no option to change the default policy, I can execute the query:
ALTER RETENTION POLICY month on home_assistant DEFAULT
It is also possible to create the retention policy with a query, this would be:
CREATE RETENTION POLICY "month" ON "home_assistant" DURATION 31d REPLICATION 1 DEFAULT
Please note that you only use default for your initial retention policy.
Creating continuous queries
In order to move the data to the next retention policy, We need to create a continuous query. This a query that runs continuously as new data points enter the database. It’s a ‘normal’ query, but embedded a CREATE CONTINUOUS QUERY statement:
CREATE CONTINUOUS QUERY <cq_name> ON <database_name>
These current measurements are to be found in my InfluxDB at this time:
These continuous queries have to be made per measurement and per retention policy. It is possible to downsample all measurements in a single query, but some advise against that as it might impact performance. So We’ll need to create 2 queries per measurement, one for the conversion to the yearly retention policy and one for the infinite. The monthly retention policy is default, We don’t need to create a query for it.
For the sake of example in this blog I will only go through this process for the Watts measurement. This will give a good example of the possibilities.
My goal is to populate the yearly retention policy with mean values of every minute. Chronograf’s explore function is ideal for experimenting with query writing.
Make sure that you group by entity ID. Otherwise your metrics will be combined and will cause some issues.
The goal is to create a query which shows the mean value of all metrics per minute. I came up with the following query:
SELECT mean(value) as value FROM "home_assistant"."autogen"."Watts" group by time(1m), entity_id FILL(linear)
Note that a FILL attribute was used, this will make sure that in case of an error/sensor fault, metrics will be interpolated linear. If preferred, it is also possible to store the max/min or any other function to a retention policy.
By analyzing this query and the amount of values, each minute contained ±30 measurements per minute, after this query, only 2 measurements are stored.
As shown, the graph becomes a bit less specific, but still does provide enough detail for analysis and other purposes, with about 6% of the database footprint that was used previously.
To make this query permanent, we have to store it into the new retention policy:
CREATE CONTINUOUS QUERY "cq_year_Watts" ON "home_assistant"
SELECT mean(value) as value
group by time(1m), entity_id
The explorer function in chronograf offers query templates, take a look at these, they’ll help you with query writing. When the query is executed successful, you can check it out by executing:
SHOW CONTINUOUS QUERIES
And eventually we can query the data from this retention policy:
SELECT value AS "mean_value" FROM "home_assistant"."year"."Watts" GROUP BY "entity_id"
For the “infinite” retention policy it is also possible to alter the query to the following:
CREATE CONTINUOUS QUERY "cq_infinite_Watts" ON "home_assistant"
SELECT mean(value) as value
group by time(1h), entity_id
And that’s it! You’ve set the first step into efficient data retention
To visualize the data in InfluxDB you can either choose to use Grafana, or play in Chronograf. My preference is to develop some graphs in Chronograf, and use them in Grafana once I’m satisfied.
At this time I created 3 graphs for my wattage measurement, 1 per retention policy (Month/Year/Infinite)
The queries were clicked together using the Explore tab and are respectively:
SELECT mean("value") AS "mean_value" FROM "home_assistant"."month"."Watts" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("entity_id"='server_stack_power' OR "entity_id"='pc_stack_power') GROUP BY time(:interval:), "entity_id" FILL(previous)
SELECT mean("value") AS "mean_value" FROM "home_assistant"."year"."Watts" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("entity_id"='pc_stack_power' OR "entity_id"='server_stack_power') GROUP BY time(1m), "entity_id" FILL(previous)
SELECT mean("value") AS "mean_value" FROM "home_assistant"."infinite"."Watts" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND ("entity_id"='pc_stack_power' OR "entity_id"='server_stack_power') GROUP BY time(1h), "entity_id" FILL(previous)
I changed the group by statement to the same value as where the mean value is calculated in the continuous query mentioned earlier. Also the FILL(previous) is meant to make the graph look nicer. Values between 16:00-17:00 are stored at the 16:00 timestamp, making the data point appear at 17:00, creating a gap at the end of your graph.
Showing your InfluxDB data in HA
When you’ve created this long term data strategy, it is understandable that you want to query this data in HA.
There are multiple options to achieve this, whereby the simplest is to fill a sensor entity with a queried value from InfluxDB. The downside is that this sensor will contain a single value and thus is not fit for showing a larger history graph. This method is described in the HA documentation.
Another option mentioned on various Home Assistant topics is the usage of the image renderer plugin. I did not try this method as it is not supported in the default docker version of Grafana.
I’ve experimented with multiple options but decided to add a custom panel that links to my Grafana instance. This way I can keep my Lovelace clear from advanced statistics and still be able to access those from my home assistant instance when preferred. I might add single sensors for specific values when I feel the need to.
url: !secret grafana_dashboard_url
That’s about it for my first blog post! I hope you’ve learned something :). Let me know in the comment section if you have any questions or comments regarding this post or discuss improvements. I want to give a shout out to dummylabs as their ‘tutorial’/documentation inspired me to write this blog. Cheers!