I’m building a job aggregator with a live data platform that provides in-depth market analysis. I’m currently focused on improving how I extract skills from job postings. While my current extraction setup achieves ~90% accuracy, it struggles with edge cases and lacks flexibility, particularly when skills are phrased in unexpected ways.
1.The Problem: 1.1: Lack of flexibility: The system only captures predefined phrases. If a job post says something like "proficiency in spreadsheets" or "experience with advanced reporting tools", it misses that Excel is likely required.
1.2: Manual maintenance: Constantly updating JSON files to account for new variations is tedious and unsustainable as the project grows.
2.Current Setup: 2.1: Keyword-based extraction: I maintain a JSON file with predefined skill variations. Example:
    "programming_languages": {
        "JavaScript": ["javascript", "js" ...],
         ...
3. Constraints: 3.1: Lightweight: I’m avoiding heavy ML models or resource-intensive pipelines to keep server costs low.
3.2: Flexible: I need a solution that better handles synonyms, context, and unexpected phrasing with minimal manual input.
3.3: Free or open-source: Ideally, something I can plug into my existing server setup without added costs.
4. My Questions: 4.1: How can I improve this process to make it more robust and context-aware?
4.2:Are there lightweight tools, heuristics, or libraries you’d recommend for handling variations and semantic similarity?
4.3: Would pre-trained embeddings (e.g., GloVe, FastText) or other lightweight NLP methods help here?
I’d love to hear from anyone who’s tackled similar challenges in NLP or information extraction. Any suggestions on balancing accuracy, flexibility, and computational efficiency would be greatly appreciated!
If anyone is interested in what my current market analysis looks like, I am leaving a link for you to analyze https://careercode.it/market