The Housekeeping Vacuum Tool looks for any files and folders in the data locations of a Hive table that are not referenced in either the Hive metastore or the Housekeeping database. Any paths discovered are considered "orphaned data" and are then scheduled for removal via the Housekeeping process. The tool is typically used if previously orphaned path locations have been lost (which can happen if the Housekeeping database has become corrupt or a previous Housekeeping run did not finish gracefully due to some unexpected failure or being killed prematurely etc.) By running the Vacuum Tool against a Hive table one can identify data that is no longer referenced by Hive and is thus a candidate for deletion.
The Vacuum Tool makes use of various Hadoop and Hive libraries and client executables so these must be installed on the machine where the tool will be run. It does not require a Hadoop cluster, a Hadoop client "jump box" will suffice.
If you are planning to run the Vacuum Tool on EMR you may need to set up the EMR classpath by exporting the following environment variables before calling the bin/vacuum.sh
script:
export HCAT_LIB=/usr/lib/hive-hcatalog/share/hcatalog/
export HIVE_LIB=/usr/lib/hive/lib/
Note that the paths above are correct as of when this document was last updated but may differ across EMR versions, refer to the EMR release guide for more up-to-date information if necessary.
Download the TGZ from Maven central and then uncompress the file by executing:
tar -xzf housekeeping-vacuum-tool-<version>-bin.tgz
Although it's not necessary, we recommend exporting the environment variable HOUSEKEEPING_TOOL_HOME
by setting its value to wherever you extracted it to:
export HOUSEKEEPING_TOOL_HOME=/<foo>/<bar>/housekeeping-vacuum-tool-<version>
Note: All updates to the table being vacuumed must be paused for the duration of the vacuum process otherwise there is a risk that folders that have been newly created but not yet added to the metastore will be considered candidates for housekeeping.
Run with your respective replication YAML configuration file:
$HOUSEKEEPING_TOOL_HOME/bin/vacuum.sh \
--config=<your-config>.yml \
[--dry-run=true] \
[--partition-batch-size=1000]
The dry-run
option allows you to observe the status of paths on the file system, the metastore, and the Housekeeping database without actually scheduling anything for deletion. The partition-batch-size
is the number of partitions to retrieve in each batch from a table, and can be changed to a lower number if an out of memory exception occurs.
Property | Required | Description |
---|---|---|
catalog.name |
Yes | A name for the source catalog for events and logging. |
catalog.hive-metastore-uris |
Yes | Fully qualified URI of the source cluster's Hive metastore Thrift service. This property mimics the Hive property hive.metastore.uris and allows multiple comma separated URIs. |
catalog.site-xml |
No | A list of Hadoop configuration XML files to add to the configuration for the source. |
catalog.configuration-properties |
No | A list of key: value pairs to add to the Hadoop configuration for the source. |
catalog.metastore-tunnel.route |
No | An SSH tunnel can be used to connect to source metastores. The tunnel may consist of one or more hops which must be declared in this property. |
catalog.metastore-tunnel.private-keys |
No | A comma-separated list of paths to any SSH keys required in order to set up the SSH tunnel. |
catalog.metastore-tunnel.known-hosts |
No | Path to a known hosts file. |
catalog.metastore-tunnel.port |
No | The port on which SSH runs on the source master node. Default is 22 . |
catalog.metastore-tunnel.local-host |
No | The address on which to bind the local end of the tunnel. Default is localhost . |
tables.database-name |
Yes | The Hive database name for the table the vacuum tool will interrogate. |
tables.table-name |
Yes | The Hive table name for the table the vacuum tool will interrogate. |
housekeeping.schema-name |
No | The schema name that is used in your housekeeping instance. Defaults to housekeeping . |
housekeeping.h2.database |
No | If the housekeeping.data-source.url is not overridden then the default H2 database can be configured using this property which also controls where H2 will write its database files. Defaults to ${instance.home}/data/${instance.name}/${housekeeping.schema-name} (where instance.home , instance.name and housekeeping.schema-name can be configured separately for more fine-grained control). |
housekeeping.data-source.driver-class-name |
No | The fully qualified class name of your database driver. Defaults to the H2 driver if not configured. |
housekeeping.data-source.url |
No | JDBC URL for your database. Defaults to H2 filesystem database if not specified. |
housekeeping.data-source.username |
No | Username for your database. |
housekeeping.data-source.password |
No | Password for your database. |
housekeeping.db-init-script |
No | A file containing a script to initialise your schema can be provided if it does not already exist. |
tables-validation.hive-table-properties |
No | A list of Hive table properties that need to exist in every configured table. If any of these properties do not exist then the vacuum tool won't run. Set this to a custom property or an empty list if you vacuum tables that are not replicated by Circus Train. We always recommend running with --dry-run=true first and carefully reviewing the results before doing a "real" vacuum. Default is com.hotels.bdp.circustrain.replication.event . |
In order to use an external JDBC-compliant database, the JDBC driver for this database must be made available on the CLASSPATH of the vacuum tool. This can be achieved by one of the following:
- Adding the path to the driver jar file to the Housekeeping bootstrap CLASSPATH (e.g.
export HOUSEKEEPING_CLASSPATH=/path/to/mysql-connector-java-x.y.z.jar
). - Placing the driver jar file in
$VACUUM_TOOL_HOME/lib/
.
The configuration then needs to be updated to be something like below:
catalog:
name: vacuum_tool
hive-metastore-uris: thrift://my-metastore-uri:9083
tables:
- database-name: db
table-name: table_1
housekeeping:
schema-name: my_db
dataSource:
driverClassName: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://db-host:3306/${housekeeping.schema-name}
username: user
password: foo
Note: To use MySQL and similar database systems, the schema specified in the configuration needs to exist, as the value for housekeeping.data-source.url
needs to be a valid URI.
The Vacuum tool already has the required H2 drivers on its CLASSPATH so the only change required to use H2 is to create a configuration file similar to below:
catalog:
name: vacuum_tool
hive-metastore-uris: thrift://my-metastore-uri:9083
tables:
- database-name: db
table-name: table_1
housekeeping:
schema-name: my_db
db-init-script: file:///tmp/schema.sql
h2:
# Location of H2 DB on filesystem
database: /home/hadoop/vacuumtest/data/${housekeeping.schema-name}
dataSource:
username: user
password: foo