Synthetic Question Generator
This application generates a set of synthetic questions from documents stored in Google Cloud Storage (GCS) and saves them to a local CSV file. For each document, it generates one question for each predefined question type (Factual, Summarization, etc.).
The output CSV is structured for easy uploading to a BigQuery table with the following schema: input (STRING), expected_output (STRING), source (STRING), type (STRING).
Usage
The script is run from the command line. You need to provide the path to the source documents within your GCS bucket and a path for the output CSV file.
Command
uv run python -m synth_gen.main [OPTIONS] GCS_PATH
Arguments
GCS_PATH: (Required) The path to the directory in your GCS bucket where the source markdown files are located (e.g.,documents/markdown/).--output-csv, -o: (Required) The local file path where the generated questions will be saved in CSV format.
Example
uv run python -m synth_gen.main documents/processed/ --output-csv synthetic_questions.csv
This command will fetch all documents from the gs://<your-bucket-name>/documents/processed/ directory, generate questions for each, and save them to a file named synthetic_questions.csv in the current directory.