Skip to content

[Feat] 스크립트 파일 생성 (#38)#99

Merged
shinae1023 merged 2 commits into
mainfrom
feat/#38-mock-jobposting-with-data
Jun 18, 2026
Merged

[Feat] 스크립트 파일 생성 (#38)#99
shinae1023 merged 2 commits into
mainfrom
feat/#38-mock-jobposting-with-data

Conversation

@shinae1023

@shinae1023 shinae1023 commented Jun 16, 2026

Copy link
Copy Markdown
Member

✨ 어떤 이유로 PR를 하셨나요?

  • feature 병합
  • 버그 수정(아래에 issue #를 남겨주세요)
  • 코드 개선
  • 코드 수정
  • 배포
  • 기타(아래에 자세한 내용 기입해주세요)

📋 세부 내용 - 왜 해당 PR이 필요한지 작업 내용을 자세하게 설명해주세요

📸 작업 화면 스크린샷

⚠️ PR하기 전에 확인해주세요

  • 로컬테스트를 진행하셨나요?
  • 머지할 브랜치를 확인하셨나요?
  • 관련 label을 선택하셨나요?

🚨 관련 이슈 번호 [ ]

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Python CLI script for importing job posting and question embedding corpus data from Excel workbooks to the database.
  • Documentation

    • Updated README with Corpus Import Script section documenting virtual environment setup, dependencies installation, execution commands, and database configuration.

@shinae1023 shinae1023 self-assigned this Jun 16, 2026
@shinae1023 shinae1023 added the ✨ feat New feature or request label Jun 16, 2026
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds scripts/import_corpus.py, a Python CLI script that reads job posting and question corpus data from an XLSX workbook and upserts rows into Postgres using INSERT ... ON CONFLICT DO UPDATE. Includes a scripts/requirements-corpus-import.txt with openpyxl and psycopg[binary] dependencies, and a new README section documenting setup and execution.

Changes

Corpus XLSX-to-Postgres Import Script

Layer / File(s) Summary
Script setup, constants, config, and dependencies
scripts/requirements-corpus-import.txt, scripts/import_corpus.py
Declares openpyxl>=3.1.0 and psycopg[binary]>=3.1.0 requirements; defines sheet name constants, ImportStats dataclass, CLI argument parser, .env file loader, JDBC-to-DSN converter, and connect() with psycopg/psycopg2 fallback.
Value normalization and Excel parsing utilities
scripts/import_corpus.py
Implements normalize_text, to_string, to_int, and to_bool (with fixed boolean token sets), plus build_header_map, get_cell, require_headers, and a fetch_one helper normalizing cursor rows to dicts across both drivers.
DB resolution helpers and upsert operations
scripts/import_corpus.py
Adds resolve_company_id (cache-backed lookup/create), resolve_detail_classification_id (multi-fallback with ImportStats tracking), and upsert_job_posting/upsert_question that perform INSERT ... ON CONFLICT DO UPDATE into mock_job_posting_corpus and mock_question_corpus.
Sheet import routines, main orchestration, and README docs
scripts/import_corpus.py, README.md
Implements import_job_posting_sheet and import_question_sheet (header validation, row iteration, payload construction, stats increments) and main() (workbook open, DB connect, both sheet imports, resource cleanup, stats print); README documents the full setup and run procedure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hoppity-hop through each Excel row,
Upsert the corpus, let the data flow!
psycopg or two, whichever's near,
Cache the companies, classifications clear.
Stats all counted, the rabbit cheers—
"INSERT ON CONFLICT brings no tears!" 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely a blank template with unchecked checkboxes and no actual content filled in, failing to explain the purpose, functionality, or implementation details. Fill in the template with: reason for PR (feature), detailed explanation of the corpus import functionality, any relevant screenshots, confirmation of local testing, target branch, and issue number #38.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[Feat] 스크립트 파일 생성 (#38)' is vague and generic, merely stating 'script file creation' without identifying which scripts or their purpose. Use a more specific title that describes the actual script purpose, e.g., '[Feat] Add corpus import script for bulk Excel data loading' or similar.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#38-mock-jobposting-with-data

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/import_corpus.py`:
- Line 502: The load_workbook function call on line 502 is loading the entire
XLSX file into memory, which causes excessive memory consumption for large
corpus files. Add the read_only parameter set to True in the load_workbook
function call to enable read-only mode, which will reduce memory pressure and
improve performance when importing large XLSX files.
- Around line 109-114: The to_int() function is silently truncating decimal
values by converting through float first and then to int, which corrupts data
like char_limit. Modify the to_int() function to validate that the numeric value
is actually an integer before conversion: after converting text to float, check
if the float value equals its integer conversion (using modulo or equality
check), and if not, raise an error or return None instead of silently truncating
with int(float(text)).
- Around line 512-522: The current code silently converts missing required
sheets to None when they are not found in the workbook (checking JD_SHEET_NAME
and QUESTION_SHEET_NAME in workbook.sheetnames), allowing the import functions
to skip processing without any error indication. Replace the conditional sheet
access with explicit validation that raises an error if the required sheets are
missing, ensuring the import fails fast instead of reporting success when
nothing was actually imported. Check for sheet existence before calling
import_job_posting_sheet and import_question_sheet functions, and raise an
appropriate exception if either required sheet is absent from the workbook.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 71f6a4d1-a711-4972-9506-9340f61250b6

📥 Commits

Reviewing files that changed from the base of the PR and between 2ad1c08 and 3dc6dbc.

📒 Files selected for processing (3)
  • README.md
  • scripts/import_corpus.py
  • scripts/requirements-corpus-import.txt

Comment thread scripts/import_corpus.py
Comment on lines +109 to +114
def to_int(value: Any) -> int | None:
text = to_string(value)
if text is None:
return None
return int(float(text))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject non-integer numeric inputs in to_int() instead of truncating.

Line 113 currently does int(float(text)), so values like "12.9" become 12 silently. That can corrupt persisted char_limit values.

Proposed fix
 def to_int(value: Any) -> int | None:
-    text = to_string(value)
-    if text is None:
-        return None
-    return int(float(text))
+    text = to_string(value)
+    if text is None:
+        return None
+    number = float(text)
+    if not number.is_integer():
+        raise ValueError(f"Expected integer-compatible value, got: {value}")
+    return int(number)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/import_corpus.py` around lines 109 - 114, The to_int() function is
silently truncating decimal values by converting through float first and then to
int, which corrupts data like char_limit. Modify the to_int() function to
validate that the numeric value is actually an integer before conversion: after
converting text to float, check if the float value equals its integer conversion
(using modulo or equality check), and if not, raise an error or return None
instead of silently truncating with int(float(text)).

Comment thread scripts/import_corpus.py
raise SystemExit(f"XLSX file not found: {xlsx_path}")

load_env_file(Path(args.env_file))
workbook = load_workbook(filename=xlsx_path, data_only=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Open the workbook in read-only mode for large corpus files.

Line 502 loads the entire workbook into memory. For large XLSX imports, this can cause unnecessary memory pressure and slower startup.

Proposed fix
-    workbook = load_workbook(filename=xlsx_path, data_only=True)
+    workbook = load_workbook(filename=xlsx_path, data_only=True, read_only=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
workbook = load_workbook(filename=xlsx_path, data_only=True)
workbook = load_workbook(filename=xlsx_path, data_only=True, read_only=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/import_corpus.py` at line 502, The load_workbook function call on
line 502 is loading the entire XLSX file into memory, which causes excessive
memory consumption for large corpus files. Add the read_only parameter set to
True in the load_workbook function call to enable read-only mode, which will
reduce memory pressure and improve performance when importing large XLSX files.

Comment thread scripts/import_corpus.py
Comment on lines +512 to +522
import_job_posting_sheet(
cur,
workbook[JD_SHEET_NAME] if JD_SHEET_NAME in workbook.sheetnames else None,
stats,
company_cache,
classification_cache,
)
import_question_sheet(
cur,
workbook[QUESTION_SHEET_NAME] if QUESTION_SHEET_NAME in workbook.sheetnames else None,
stats,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when required sheets are missing instead of silently skipping.

At Line 514 and Line 521, missing sheets are converted to None, and the import functions return early. This can report successful completion while importing nothing.

Proposed fix
     load_env_file(Path(args.env_file))
     workbook = load_workbook(filename=xlsx_path, data_only=True)
+    missing_sheets = [
+        name for name in (JD_SHEET_NAME, QUESTION_SHEET_NAME)
+        if name not in workbook.sheetnames
+    ]
+    if missing_sheets:
+        raise SystemExit(f"Missing required sheet(s): {', '.join(missing_sheets)}")

@@
                 import_job_posting_sheet(
                     cur,
-                    workbook[JD_SHEET_NAME] if JD_SHEET_NAME in workbook.sheetnames else None,
+                    workbook[JD_SHEET_NAME],
                     stats,
                     company_cache,
                     classification_cache,
                 )
                 import_question_sheet(
                     cur,
-                    workbook[QUESTION_SHEET_NAME] if QUESTION_SHEET_NAME in workbook.sheetnames else None,
+                    workbook[QUESTION_SHEET_NAME],
                     stats,
                     company_cache,
                     classification_cache,
                 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/import_corpus.py` around lines 512 - 522, The current code silently
converts missing required sheets to None when they are not found in the workbook
(checking JD_SHEET_NAME and QUESTION_SHEET_NAME in workbook.sheetnames),
allowing the import functions to skip processing without any error indication.
Replace the conditional sheet access with explicit validation that raises an
error if the required sheets are missing, ensuring the import fails fast instead
of reporting success when nothing was actually imported. Check for sheet
existence before calling import_job_posting_sheet and import_question_sheet
functions, and raise an appropriate exception if either required sheet is absent
from the workbook.

@shinae1023 shinae1023 merged commit 371b777 into main Jun 18, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

✨ feat New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant