Skip to main content

Ingestion failures

This runbook documents failures we have encountered with the ingestion process.

Recursion errors

If the Datahub ingestion process exceeds python’s recursion limit, it may fail early, with an error like this:

2024-07-03T11:39:41.2461669Z [2024-07-03 11:39:41,181] ERROR    {datahub.ingestion.run.pipeline:491} - Caught error
2024-07-03T11:39:41.2463027Z Traceback (most recent call last):
2024-07-03T11:39:41.2465452Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/datahub/sql_parsing/sqlglot_utils.py", line 239, in try_format_query
2024-07-03T11:39:41.2467355Z     return expression.sql(dialect=dialect, pretty=True)
2024-07-03T11:39:41.2468362Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2470085Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/expressions.py", line 601, in sql
2024-07-03T11:39:41.2471674Z     return Dialect.get_or_raise(dialect).generate(self, **opts)
2024-07-03T11:39:41.2472492Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2475987Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/dialects/dialect.py", line 520, in generate
2024-07-03T11:39:41.2477701Z     return self.generator(**opts).generate(expression, copy=copy)
2024-07-03T11:39:41.2478569Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2480275Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 620, in generate
2024-07-03T11:39:41.2481709Z     sql = self.sql(expression).strip()
2024-07-03T11:39:41.2482328Z           ^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2483848Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 774, in sql
2024-07-03T11:39:41.2485246Z     sql = transform(self, expression)
2024-07-03T11:39:41.2485862Z           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2487436Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/transforms.py", line 647, in _to_sql
2024-07-03T11:39:41.2488861Z     return _sql_handler(expression)
2024-07-03T11:39:41.2489452Z            ^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2491034Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 2324, in select_sql
2024-07-03T11:39:41.2492855Z     sql = self.prepend_ctes(expression, sql)
2024-07-03T11:39:41.2493537Z           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2495176Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 1083, in prepend_ctes
2024-07-03T11:39:41.2496769Z     with_ = self.sql(expression, "with")
2024-07-03T11:39:41.2497406Z             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2498967Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 768, in sql
2024-07-03T11:39:41.2500482Z     return self.sql(value)
2024-07-03T11:39:41.2501000Z            ^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2502568Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 774, in sql
2024-07-03T11:39:41.2503954Z     sql = transform(self, expression)
2024-07-03T11:39:41.2504549Z           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2506141Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/transforms.py", line 647, in _to_sql
2024-07-03T11:39:41.2507542Z     return _sql_handler(expression)
2024-07-03T11:39:41.2508318Z            ^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.2509887Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 1089, in with_sql

...

2024-07-03T11:39:41.3934553Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 779, in sql
2024-07-03T11:39:41.3934744Z     sql = getattr(self, exp_handler_name)(expression)
2024-07-03T11:39:41.3934893Z           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.3935578Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 3139, in dpipe_sql
2024-07-03T11:39:41.3935728Z     return self.binary(expression, "||")
2024-07-03T11:39:41.3935862Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.3936524Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 3271, in binary
2024-07-03T11:39:41.3936891Z     return f"{self.sql(expression, 'this')} {op} {self.sql(expression, 'expression')}"
2024-07-03T11:39:41.3937027Z               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.3937658Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 768, in sql
2024-07-03T11:39:41.3937792Z     return self.sql(value)
2024-07-03T11:39:41.3937913Z            ^^^^^^^^^^^^^^^
2024-07-03T11:39:41.3938542Z   File "/home/runner/work/data-catalogue/data-catalogue/.venv/lib/python3.11/site-packages/sqlglot/generator.py", line 773, in sql
2024-07-03T11:39:41.3938689Z     if callable(transform):
2024-07-03T11:39:41.3938807Z        ^^^^^^^^^^^^^^^^^^^
2024-07-03T11:39:41.3939133Z RecursionError: maximum recursion depth exceeded while calling a Python object

Recursion errors may also be caught and reported at INFO level; for example:

2024-07-03T11:05:37.3320639Z [2024-07-03 11:05:37,331] INFO     {datahub.ingestion.source.dbt.dbt_common:1161} - Failed to parse compiled code for snapshot.mojap_derived_tables.derived_delius_dim__snapshot_sentences_at_disp: maximum recursion depth exceeded

We have seen this happen when calling sqlglot, which formats compiled SQL code we ingest from DBT.

Whether the recursion limit is reached depends on:

  • the SQL used in Create a Derived Table (only very long expressions will trigger the issue)
  • the compiled_code being output in the DBT manifest (it’s possible this depends on the exact DBT command that generated the manifest)
  • the runtime behaviour of Datahub (exactly when the recursion limit is reached depends on the runtime call stack)

This happens because sqlglot parses long expressions into a deeply nested tree; for example, the SQL expression a || a || a produces the following abstract syntax tree:

DPipe(
    this=DPipe(
        this=Column(
            this=Identifier(this=a, quoted=False)
        ),
        expression=Column(
            this=Identifier(this=a, quoted=False)
        ),
        safe=True
    ),
    expression=Column(
        this=Identifier(this=a, quoted=False)
    ),
    safe=True
)

sqlglot’s generator class relies on mutually recursive methods to produce SQL strings, so if the abstract syntax tree is too deep, it will hit recursion limits.

To mitigate this, we have switched off include_compiled_code and include_column_lineage for the DBT ingestion source.

This page was last reviewed on 16 July 2024. It needs to be reviewed again on 16 October 2024 by the page owner #data-catalogue .
This page was set to be reviewed before 16 October 2024 by the page owner #data-catalogue. This might mean the content is out of date.