You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i recently encountered an issue where some projects that were previously listed in the registries are no longer available. this breaks our offsetsDB data ingestion pipeline due to the following constraint:
every project in the credit and clip tables must have a corresponding project record in the project table.
whenever a project is listed and we pick it up in weekly summary clips or curated clips, and then it disappears from the registry, we end up with a broken ingestion pipeline. for example, back in May, ACR988 was added to the database. however, this project no longer shows up in the data we download from ACR.
to address this edge case, i made some changes to ensure that placeholder projects are created (using data that we can easily derive from a project_id, e.g., registry, details_url) for missing project IDs.
in #124, i made the following changes to handle this edge case
defensure_projects_exist(df: pd.DataFrame, session: Session) ->None:
""" Ensure all project IDs in the dataframe exist in the database. If not, create placeholder projects for missing IDs. """logger.info('🔍 Checking for missing project IDs')
# Get all unique project IDs from the dataframecredit_project_ids=df['project_id'].unique()
# Query existing project IDsexisting_project_ids=set(
session.exec(
select(Project.project_id).where(col(Project.project_id).in_(credit_project_ids))
).all()
)
# Identify missing project IDsmissing_project_ids=set(credit_project_ids) -existing_project_idslogger.info(f'🔍 Found {len(existing_project_ids)} existing project IDs')
logger.info(f'🔍 Found {len(missing_project_ids)} missing project IDs: {missing_project_ids}')
# Create placeholder projects for missing IDsurls= {
'verra': 'https://registry.verra.org/app/projectDetail/VCS/',
'gold-standard': 'https://registry.goldstandard.org/projects?q=gs',
'american-carbon-registry': 'https://acr2.apx.com/mymodule/reg/prjView.asp?id1=',
'climate-action-reserve': 'https://thereserve2.apx.com/mymodule/reg/prjView.asp?id1=',
'art-trees': 'https://art.apx.com/mymodule/reg/prjView.asp?id1=',
}
forproject_idinmissing_project_ids:
registry=get_registry_from_project_id(project_id)
ifurl:=urls.get(registry):
url=f'{url}{project_id[3:]}'placeholder_project=Project(
project_id=project_id,
registry=registry,
category=['unknown'],
protocol=['unknown'],
project_url=url,
)
session.add(placeholder_project)
try:
session.commit()
logger.info(f'✅ Added {len(missing_project_ids)} missing project IDs to the database')
exceptIntegrityErrorasexc:
session.rollback()
logger.error(f'❌ Error creating placeholder projects: {exc}')
raisedefprocess_dataframe(df, table_name, engine, dtype_dict=None):
logger.info(f'📝 Writing DataFrame to {table_name}')
logger.info(f'engine: {engine}')
withengine.begin() asconn:
ifengine.dialect.has_table(conn, table_name):
# Instead of dropping table (which results in data type, schema overrides), delete all rows.conn.execute(text(f'TRUNCATE TABLE {table_name} RESTART IDENTITY CASCADE;'))
iftable_namein {'credit', 'clipproject'}:
session=next(get_session())
try:
logger.info(f'Processing data destined for {table_name} table...')
ensure_projects_exist(df, session)
exceptIntegrityError:
logger.error('❌ Failed to ensure projects exist. Continuing with data insertion.')
# write the datadf.to_sql(table_name, conn, if_exists='append', index=False, dtype=dtype_dict)
this change ensures that our database remains consistent even when projects are unlisted from the registries
The text was updated successfully, but these errors were encountered:
i recently encountered an issue where some projects that were previously listed in the registries are no longer available. this breaks our offsetsDB data ingestion pipeline due to the following constraint:
credit
andclip
tables must have a corresponding project record in theproject
table.whenever a project is listed and we pick it up in weekly summary clips or curated clips, and then it disappears from the registry, we end up with a broken ingestion pipeline. for example, back in May, ACR988 was added to the database. however, this project no longer shows up in the data we download from ACR.
to address this edge case, i made some changes to ensure that placeholder projects are created (using data that we can easily derive from a project_id, e.g.,
registry
,details_url
) for missing project IDs.in #124, i made the following changes to handle this edge case
this change ensures that our database remains consistent even when projects are unlisted from the registries
The text was updated successfully, but these errors were encountered: