[Feat] 스크립트 파일 생성 (#38) by shinae1023 · Pull Request #99 · JobDri-Developer/BackEnd

shinae1023 · 2026-06-16T12:51:30Z

✨ 어떤 이유로 PR를 하셨나요?

📋 세부 내용 - 왜 해당 PR이 필요한지 작업 내용을 자세하게 설명해주세요

📸 작업 화면 스크린샷

⚠️ PR하기 전에 확인해주세요

로컬테스트를 진행하셨나요?
머지할 브랜치를 확인하셨나요?
관련 label을 선택하셨나요?

🚨 관련 이슈 번호 [ ]

Summary by CodeRabbit

Release Notes

New Features
- Added Python CLI script for importing job posting and question embedding corpus data from Excel workbooks to the database.
Documentation
- Updated README with Corpus Import Script section documenting virtual environment setup, dependencies installation, execution commands, and database configuration.

coderabbitai · 2026-06-16T12:52:06Z

📝 Walkthrough

Walkthrough

Adds scripts/import_corpus.py, a Python CLI script that reads job posting and question corpus data from an XLSX workbook and upserts rows into Postgres using INSERT ... ON CONFLICT DO UPDATE. Includes a scripts/requirements-corpus-import.txt with openpyxl and psycopg[binary] dependencies, and a new README section documenting setup and execution.

Changes

Corpus XLSX-to-Postgres Import Script

Layer / File(s)	Summary
Script setup, constants, config, and dependencies `scripts/requirements-corpus-import.txt`, `scripts/import_corpus.py`	Declares `openpyxl>=3.1.0` and `psycopg[binary]>=3.1.0` requirements; defines sheet name constants, `ImportStats` dataclass, CLI argument parser, `.env` file loader, JDBC-to-DSN converter, and `connect()` with `psycopg`/`psycopg2` fallback.
Value normalization and Excel parsing utilities `scripts/import_corpus.py`	Implements `normalize_text`, `to_string`, `to_int`, and `to_bool` (with fixed boolean token sets), plus `build_header_map`, `get_cell`, `require_headers`, and a `fetch_one` helper normalizing cursor rows to dicts across both drivers.
DB resolution helpers and upsert operations `scripts/import_corpus.py`	Adds `resolve_company_id` (cache-backed lookup/create), `resolve_detail_classification_id` (multi-fallback with `ImportStats` tracking), and `upsert_job_posting`/`upsert_question` that perform `INSERT ... ON CONFLICT DO UPDATE` into `mock_job_posting_corpus` and `mock_question_corpus`.
Sheet import routines, main orchestration, and README docs `scripts/import_corpus.py`, `README.md`	Implements `import_job_posting_sheet` and `import_question_sheet` (header validation, row iteration, payload construction, stats increments) and `main()` (workbook open, DB connect, both sheet imports, resource cleanup, stats print); README documents the full setup and run procedure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hoppity-hop through each Excel row,
Upsert the corpus, let the data flow!
psycopg or two, whichever's near,
Cache the companies, classifications clear.
Stats all counted, the rabbit cheers—
"INSERT ON CONFLICT brings no tears!" 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely a blank template with unchecked checkboxes and no actual content filled in, failing to explain the purpose, functionality, or implementation details.	Fill in the template with: reason for PR (feature), detailed explanation of the corpus import functionality, any relevant screenshots, confirmation of local testing, target branch, and issue number `#38`.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title '[Feat] 스크립트 파일 생성 (`#38`)' is vague and generic, merely stating 'script file creation' without identifying which scripts or their purpose.	Use a more specific title that describes the actual script purpose, e.g., '[Feat] Add corpus import script for bulk Excel data loading' or similar.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/#38-mock-jobposting-with-data

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/import_corpus.py`:
- Line 502: The load_workbook function call on line 502 is loading the entire
XLSX file into memory, which causes excessive memory consumption for large
corpus files. Add the read_only parameter set to True in the load_workbook
function call to enable read-only mode, which will reduce memory pressure and
improve performance when importing large XLSX files.
- Around line 109-114: The to_int() function is silently truncating decimal
values by converting through float first and then to int, which corrupts data
like char_limit. Modify the to_int() function to validate that the numeric value
is actually an integer before conversion: after converting text to float, check
if the float value equals its integer conversion (using modulo or equality
check), and if not, raise an error or return None instead of silently truncating
with int(float(text)).
- Around line 512-522: The current code silently converts missing required
sheets to None when they are not found in the workbook (checking JD_SHEET_NAME
and QUESTION_SHEET_NAME in workbook.sheetnames), allowing the import functions
to skip processing without any error indication. Replace the conditional sheet
access with explicit validation that raises an error if the required sheets are
missing, ensuring the import fails fast instead of reporting success when
nothing was actually imported. Check for sheet existence before calling
import_job_posting_sheet and import_question_sheet functions, and raise an
appropriate exception if either required sheet is absent from the workbook.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 71f6a4d1-a711-4972-9506-9340f61250b6

📥 Commits

Reviewing files that changed from the base of the PR and between 2ad1c08 and 3dc6dbc.

📒 Files selected for processing (3)

README.md
scripts/import_corpus.py
scripts/requirements-corpus-import.txt

coderabbitai · 2026-06-18T02:34:59Z

+def to_int(value: Any) -> int | None:
+    text = to_string(value)
+    if text is None:
+        return None
+    return int(float(text))
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject non-integer numeric inputs in to_int() instead of truncating.

Line 113 currently does int(float(text)), so values like "12.9" become 12 silently. That can corrupt persisted char_limit values.

Proposed fix

def to_int(value: Any) -> int | None: - text = to_string(value) - if text is None: - return None - return int(float(text)) + text = to_string(value) + if text is None: + return None + number = float(text) + if not number.is_integer(): + raise ValueError(f"Expected integer-compatible value, got: {value}") + return int(number)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/import_corpus.py` around lines 109 - 114, The to_int() function is silently truncating decimal values by converting through float first and then to int, which corrupts data like char_limit. Modify the to_int() function to validate that the numeric value is actually an integer before conversion: after converting text to float, check if the float value equals its integer conversion (using modulo or equality check), and if not, raise an error or return None instead of silently truncating with int(float(text)).

coderabbitai · 2026-06-18T02:35:00Z

+        raise SystemExit(f"XLSX file not found: {xlsx_path}")
+
+    load_env_file(Path(args.env_file))
+    workbook = load_workbook(filename=xlsx_path, data_only=True)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Open the workbook in read-only mode for large corpus files.

Line 502 loads the entire workbook into memory. For large XLSX imports, this can cause unnecessary memory pressure and slower startup.

Proposed fix

- workbook = load_workbook(filename=xlsx_path, data_only=True) + workbook = load_workbook(filename=xlsx_path, data_only=True, read_only=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

workbook = load_workbook(filename=xlsx_path, data_only=True)

workbook = load_workbook(filename=xlsx_path, data_only=True, read_only=True)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/import_corpus.py` at line 502, The load_workbook function call on line 502 is loading the entire XLSX file into memory, which causes excessive memory consumption for large corpus files. Add the read_only parameter set to True in the load_workbook function call to enable read-only mode, which will reduce memory pressure and improve performance when importing large XLSX files.

coderabbitai · 2026-06-18T02:35:00Z

+                import_job_posting_sheet(
+                    cur,
+                    workbook[JD_SHEET_NAME] if JD_SHEET_NAME in workbook.sheetnames else None,
+                    stats,
+                    company_cache,
+                    classification_cache,
+                )
+                import_question_sheet(
+                    cur,
+                    workbook[QUESTION_SHEET_NAME] if QUESTION_SHEET_NAME in workbook.sheetnames else None,
+                    stats,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when required sheets are missing instead of silently skipping.

At Line 514 and Line 521, missing sheets are converted to None, and the import functions return early. This can report successful completion while importing nothing.

Proposed fix

load_env_file(Path(args.env_file)) workbook = load_workbook(filename=xlsx_path, data_only=True) + missing_sheets = [ + name for name in (JD_SHEET_NAME, QUESTION_SHEET_NAME) + if name not in workbook.sheetnames + ] + if missing_sheets: + raise SystemExit(f"Missing required sheet(s): {', '.join(missing_sheets)}") @@ import_job_posting_sheet( cur, - workbook[JD_SHEET_NAME] if JD_SHEET_NAME in workbook.sheetnames else None, + workbook[JD_SHEET_NAME], stats, company_cache, classification_cache, ) import_question_sheet( cur, - workbook[QUESTION_SHEET_NAME] if QUESTION_SHEET_NAME in workbook.sheetnames else None, + workbook[QUESTION_SHEET_NAME], stats, company_cache, classification_cache, )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/import_corpus.py` around lines 512 - 522, The current code silently converts missing required sheets to None when they are not found in the workbook (checking JD_SHEET_NAME and QUESTION_SHEET_NAME in workbook.sheetnames), allowing the import functions to skip processing without any error indication. Replace the conditional sheet access with explicit validation that raises an error if the required sheets are missing, ensuring the import fails fast instead of reporting success when nothing was actually imported. Check for sheet existence before calling import_job_posting_sheet and import_question_sheet functions, and raise an appropriate exception if either required sheet is absent from the workbook.

[Feat] 스크립트 파일 생성 (#38)

105534b

shinae1023 self-assigned this Jun 16, 2026

shinae1023 added the ✨ feat New feature or request label Jun 16, 2026

[Chore] 불필요 파일 삭제

3dc6dbc

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

shinae1023 merged commit 371b777 into main Jun 18, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] 스크립트 파일 생성 (#38)#99

[Feat] 스크립트 파일 생성 (#38)#99
shinae1023 merged 2 commits into
mainfrom
feat/#38-mock-jobposting-with-data

shinae1023 commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

coderabbitai Bot Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	workbook = load_workbook(filename=xlsx_path, data_only=True)
	workbook = load_workbook(filename=xlsx_path, data_only=True, read_only=True)

Uh oh!

Conversation

shinae1023 commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ 어떤 이유로 PR를 하셨나요?

📋 세부 내용 - 왜 해당 PR이 필요한지 작업 내용을 자세하게 설명해주세요

📸 작업 화면 스크린샷

⚠️ PR하기 전에 확인해주세요

🚨 관련 이슈 번호 [ ]

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shinae1023 commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading