The Resume Parser for Extracting Information with SpaCy’s Magic

Sanket Sarwade 12 Jul, 2023 • 12 min read

Introduction

Resume parsing, a valuable tool used in real-life scenarios to simplify and streamline the hiring process, has become essential for busy hiring managers and human resources professionals. By automating the initial screening of resumes using SpaCy‘s magic , a resume parser acts as a smart assistant, leveraging advanced algorithms and natural language processing techniques to extract key details such as contact information, education history, work experience, and skills.

This structured data allows recruiters to efficiently evaluate candidates, search for specific qualifications, and integrate the parsing technology with applicant tracking systems or recruitment software. By saving time, reducing errors, and facilitating informed decision-making, resume parsing technology revolutionizes the resume screening process and enhances the overall recruitment experience.

Check out the Github Depository here.

Learning Objectives

Before we dive into the technical details, let’s outline the learning objectives of this guide:

Understand the concept of resume parsing and its importance in the recruitment process.
Learn how to set up the development environment for building a resume parser using spaCy.
Explore techniques to extract text from resumes in different formats.
Implement methods to extract contact information, including phone numbers and email addresses, from resume text.
Develop skills to identify and extract relevant skills mentioned in resumes.
Gain knowledge on extracting educational qualifications from resumes.
Utilize spaCy and its matcher to extract the candidate’s name from resume text.
Apply the learned concepts to parse a sample resume and extract essential information.
Appreciate the significance of automating the resume parsing process for efficient recruitment.

Now, let’s delve into each section of the guide and understand how to accomplish these objectives.

This article was published as a part of the Data Science Blogathon.

Introduction
What is SpaCy?
Setting up the Development Environment
Extracting Text from Resumes
Extracting Contact Information
Extracting Email Address
Extracting Skills
Extracting Education
Extracting Name Using spaCy
Parsing a Sample Resume
Challenges in Resume Parcer Development
Conclusion
Frequently Asked Questions

What is SpaCy?

SpaCy, a powerful open-source library for natural language processing (NLP) in Python, is a valuable tool in the context of resume parsing. It offers pre-trained models for tasks like named entity recognition (NER) and part-of-speech (POS) tagging, allowing it to effectively extract and categorize information from resumes. With its linguistic algorithms, rule-based matching capabilities, and customization options, SpaCy stands out as a preferred choice for its speed, performance, and ease of use.

By utilizing SpaCy for resume parsing, recruiters can save time and effort by automating the extraction of key details from resumes. The library’s accurate data extraction reduces human error and ensures consistent results, enhancing the overall quality of the candidate screening process. Moreover, SpaCy’s advanced NLP capabilities enable sophisticated analysis, providing valuable insights and contextual information that aid recruiters in making informed assessments.

Another advantage of SpaCy is its seamless integration with other libraries and frameworks, such as scikit-learn and TensorFlow. This integration opens up opportunities for further automation and advanced analysis, allowing for the application of machine learning algorithms and more extensive data processing.

In summary, SpaCy is a powerful NLP library used in resume parsing due to its ability to extract and analyze information from resumes effectively. Its pre-trained models, linguistic algorithms, and rule-based matching capabilities make it a valuable tool for automating the initial screening of candidates, saving time, reducing errors, and enabling deeper analysis.

Note: I have developed a resume parser using two distinct approaches. The first method, available on my GitHub account, offers a straightforward approach. In the second method, I leveraged the remarkable capabilities of spaCy, an exceptional natural language processing library. Through this integration, I have enhanced the resume parsing process, effortlessly extracting valuable information from resumes.

Here is the complete code from Github.

Setting up the Development Environment

Before we can start building our resume parser, we need to set up our development environment. Here are the steps to get started:

Install Python: Make sure Python is installed on your system. You can download the latest version of Python from the official Python website (https://www.python.org) and follow the installation instructions for your operating system.
Install spaCy: Open a command prompt or terminal and use the following command to install spaCy:

!pip install spacy

Download spaCy’s English Language Model: spaCy provides pre-trained models for different languages. We’ll be using the English language model for our resume parser. Download the English language model by running the following command:

python -m spacy download en_core_web_sm

Install additional libraries: We’ll be using the pdfminer.six library to extract text from PDF resumes. Install it using the following command:

pip install pdfminer.six

Once you have completed these steps, your development environment will be ready for building the resume parser.

Extracting Text from Resumes

The first step in resume parsing is to extract the text from resumes in various formats, such as PDF or Word documents. We’ll be using the pdfminer.six library to extract text from PDF resumes. Here’s a function that takes a PDF file path as input and returns the extracted text:

import re
from pdfminer.high_level import extract_text

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

You can call this function with the path to your PDF resume and obtain the extracted text.

Extracting Contact Information

Contact information, including phone numbers, email addresses, and physical addresses, is crucial for reaching out to potential candidates. Extracting this information accurately is an essential part of resume parsing. We can use regular expressions to match patterns and extract contact information.

Function to Extract

Let’s define a function to extract a contact number from the resume text:

import re

def extract_contact_number_from_resume(text):
    contact_number = None

    # Use regex pattern to find a potential contact number
    pattern = r"\b(?:\+?\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"
    match = re.search(pattern, text)
    if match:
        contact_number = match.group()

    return contact_number

We define a regex pattern to match the contact number format we’re looking for. The pattern r”\b(?:\+?\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b” is used in this case.

Pattern Components

Here’s a breakdown of the pattern components:

\b: Matches a word boundary to ensure the number is not part of a larger word.
(?:\+?\d{1,3}[-.\s]?)?: Matches an optional country code (e.g., +1 or +91) followed by an optional separator (-, ., or space).
\(?: Matches an optional opening parenthesis for the area code.
\d{3}: Matches exactly three digits for the area code.
\)?: Matches an optional closing parenthesis for the area code.
[-.\s]?: Matches an optional separator between the area code and the next part of the number.
\d{3}: Matches exactly three digits for the next part of the number.
[-.\s]?: Matches an optional separator between the next part of the number and the final part.
\d{4}: Matches exactly four digits for the final part of the number.
\b: Matches a word boundary to ensure the number is not part of a larger word.

The provided regex pattern is designed to match a common format for contact numbers. However, it’s important to note that contact number formats can vary across different countries and regions. The pattern provided is a general pattern that covers common formats, but it may not capture all possible variations.

If you are parsing resumes from specific regions or countries, it’s recommended to customize the regex pattern to match the specific contact number formats used in those regions. You may need to consider country codes, area codes, separators, and number length variations.

It’s also worth mentioning that phone number formats can change over time, so it’s a good practice to periodically review and update the regex pattern to ensure it remains accurate.

Find Contact Number with Country Code

# This is Another method to find contact number with country code +91

pattern = [
    {"ORTH": "+"},
    {"ORTH": "91"},
    {"SHAPE": "dddddddddd"}
]

For more information go throght spaCy’s documentation.

At the end of the article, we are going to discuss some common problems regarding to different codes we need during resume parser coding.

Extracting Email Address

In addition to the contact number, extracting the email address is vital for communication with candidates. We can again use regular expressions to match patterns and extract the email address. Here’s a function to extract the email address from the resume text:

import re

def extract_email_from_resume(text):
    email = None

    # Use regex pattern to find a potential email address
    pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b"
    match = re.search(pattern, text)
    if match:
        email = match.group()

    return email

The regex pattern used in this code is r”\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b”. Let’s break down the pattern:

\b: Represents a word boundary to ensure that the email address is not part of a larger word.
[A-Za-z0-9._%+-]+: Matches one or more occurrences of alphabetic characters (both uppercase and lowercase), digits, periods, underscores, percent signs, or hyphens. This part represents the local part of the email address before the “@” symbol.
@: Matches the “@” symbol.
[A-Za-z0-9.-]+: Matches one or more occurrences of alphabetic characters (both uppercase and lowercase), digits, periods, or hyphens. This part represents the domain name (e.g., gmail, yahoo) of the email address.
\.: Matches a period (dot) character.
[A-Za-z]{2,}: Matches two or more occurrences of alphabetic characters (both uppercase and lowercase). This part represents the top-level domain (e.g., com, edu) of the email address.
\b: Represents another word boundary to ensure the email address is not part of a larger word.

#Alternative code

def extract_email_from_resume(text):
    email = None

    # Split the text into words
    words = text.split()

    # Iterate through the words and check for a potential email address
    for word in words:
        if "@" in word:
            email = word.strip()
            break

    return email

While the alternative code is simpler to understand for beginners, it may not handle more complex email address formats or consider email addresses separated by special characters. The initial code with the regex pattern provides a more comprehensive approach to identify potential email addresses based on common conventions.

Extracting Skills

Identifying the skills mentioned in a resume is crucial for determining the candidate’s qualifications. We can create a list of relevant skills and match them against the resume text to extract the mentioned skills. Let’s define a function to extract skills from the resume text:

import re

def extract_skills_from_resume(text, skills_list):
    skills = []

    for skill in skills_list:
        pattern = r"\b{}\b".format(re.escape(skill))
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            skills.append(skill)

    return skills

Here’s a breakdown of the code and its pattern:

The function takes two parameters: text (the resume text) and skills_list (a list of skills to search for).
It initializes an empty list skills to store the extracted skills.
It iterates through each skill in the skills_list.
Inside the loop, a regex pattern is constructed using re.escape(skill) to escape any special characters present in the skill. This ensures that the pattern will match the exact skill as a whole word.
The pattern is enclosed between \b word boundaries. This ensures that the skill is not part of a larger word and is treated as a separate entity.
The re.IGNORECASE flag is used with re.search() to perform a case-insensitive search. This allows matching skills regardless of their case (e.g., “Python” or “python”).
The re.search() function is used to search for the pattern within the resume text.
If a match is found, indicating the presence of the skill in the resume, the skill is appended to the skills list.
After iterating through all the skills in the skills_list, the function returns the extracted skills as a list.

Note: The regex pattern used in this code assumes that skills are represented as whole words and not as parts of larger words. It may not handle variations in skill representations or account for skills mentioned in a different format.

If you want to find some specific skills from resume, then this code will be usefull.

if __name__ == '__main__':
    text = extract_text_from_pdf(pdf_path)

    # List of predefined skills
    skills_list = ['Python', 'Data Analysis', 'Machine Learning', 'Communication', 'Project Management', 'Deep Learning', 'SQL', 'Tableau']

    extracted_skills = extract_skills_from_resume(text, skills_list)

    if extracted_skills:
        print("Skills:", extracted_skills)
    else:
        print("No skills found")

Replace pdf_path with your file location. skills_list can be updated as your need.

Extracting Education

Education qualifications play a vital role in the recruitment process. We can match specific education keywords against the resume text to identify the candidate’s educational background. Here’s a function to extract education information from the resume text:

import re

def extract_education_from_resume(text):
    education = []

    # List of education keywords to match against
    education_keywords = ['Bsc', 'B. Pharmacy', 'B Pharmacy', 'Msc', 'M. Pharmacy', 'Ph.D', 'Bachelor', 'Master']

    for keyword in education_keywords:
        pattern = r"(?i)\b{}\b".format(re.escape(keyword))
        match = re.search(pattern, text)
        if match:
            education.append(match.group())

    return education

#Alternative Code:

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

def extract_education_from_resume(text):
    education = []

    # Use regex pattern to find education information
    pattern = r"(?i)(?:Bsc|\bB\.\w+|\bM\.\w+|\bPh\.D\.\w+|\bBachelor(?:'s)?|\bMaster(?:'s)?|\bPh\.D)\s(?:\w+\s)*\w+"
    matches = re.findall(pattern, text)
    for match in matches:
        education.append(match.strip())

    return education

if __name__ == '__main__':
    text = extract_text_from_pdf(r"C:\Users\SANKET\Downloads\Untitled-resume.pdf")

    extracted_education = extract_education_from_resume(text)
    if extracted_education:
        print("Education:", extracted_education)
    else:
        print("No education information found")

#Note : You need to create pattern as per your requirement.

Extracting Name Using spaCy

Identifying the candidate’s name from the resume is essential for personalization and identification. We can use spaCy and its pattern matching capabilities to extract the candidate’s name. Let’s define a function to extract the name using spaCy:

import spacy
from spacy.matcher import Matcher

def extract_name(resume_text):
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)

    # Define name patterns
    patterns = [
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}],  # First name and Last name
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}],  # First name, Middle name, and Last name
        [{'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}, {'POS': 'PROPN'}]  # First name, Middle name, Middle name, and Last name
        # Add more patterns as needed
    ]

    for pattern in patterns:
        matcher.add('NAME', patterns=[pattern])

    doc = nlp(resume_text)
    matches = matcher(doc)

    for match_id, start, end in matches:
        span = doc[start:end]
        return span.text

    return None

#Alternative Method:

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

def extract_name_from_resume(text):
    name = None

    # Use regex pattern to find a potential name
    pattern = r"(\b[A-Z][a-z]+\b)\s(\b[A-Z][a-z]+\b)"
    match = re.search(pattern, text)
    if match:
        name = match.group()

    return name

if __name__ == '__main__':
    text = extract_text_from_pdf(pdf_path)
    name = extract_name_from_resume(text)

    if name:
        print("Name:", name)
    else:
        print("Name not found")

The regex pattern r”(\b[A-Z][a-z]+\b)\s(\b[A-Z][a-z]+\b)” is used to find a potential name pattern in the resume text.

The pattern consists of two parts enclosed in parentheses:

(\b[A-Z][a-z]+\b): This part matches a word starting with an uppercase letter followed by one or more lowercase letters. It represents the first name.
\s: This part matches a single whitespace character to separate the first and last names.
(\b[A-Z][a-z]+\b): This part matches a word starting with an uppercase letter followed by one or more lowercase letters. It represents the last name.

Replace pdf_path with your file path.

Parsing a Sample Resume

To put everything together, let’s create a sample resume and parse it using our resume parser functions. Here’s an example:

if __name__ == '__main__':
    resume_text = "John Doe\n\nContact Information: 123-456-7890, [email protected]\n\nSkills: Python, Data Analysis, Communication\n\nEducation: Bachelor of Science in Computer Science\n\nExperience: Software Engineer at XYZ Company"
    
    print("Resume:")
    print(resume_text)

    name = extract_name(resume_text)
    if name:
        print("Name:", name)
    else:
        print("Name not found")

    contact_number = extract_contact_number_from_resume(resume_text)
    if contact_number:
        print("Contact Number:", contact_number)
    else:
        print("Contact Number not found")

    email = extract_email_from_resume(resume_text)
    if email:
        print("Email:", email)
    else:
        print("Email not found")

    skills_list = ['Python', 'Data Analysis', 'Machine Learning', 'Communication']
    extracted_skills = extract_skills_from_resume(resume_text, skills_list)
    if extracted_skills:
        print("Skills:", extracted_skills)
    else:
        print("No skills found")

    extracted_education = extract_education_from_resume(resume_text)
    if extracted_education:
        print("Education:", extracted_education)
    else:
        print("No education information found")

Challenges in Resume Parcer Development

Developing a resume parser can be a complex task with several challenges along the way. Here are some common problems we encountered and suggestions for addressing them in a more human-friendly manner:

Accurate Text Extraction

One of the main challenges is extracting text accurately from resumes, especially when dealing with PDF formats. At times, the extraction process may distort or introduce errors in the extracted text, resulting in the retrieval of incorrect information. To overcome this, we need to rely on reliable libraries or tools specifically designed for PDF text extraction, such as pdfminer, to ensure accurate results.

Dealing with Formatting Variations

Resumes come in various formats, layouts, and structures, making it difficult to extract information consistently. Some resumes may use tables, columns, or unconventional formatting, which can complicate the extraction process. To handle this, we need to consider these formatting variations and employ techniques like regular expressions or natural language processing to accurately extract the relevant information.

Extracting Names

Extracting the candidate’s name accurately can be a challenge, especially if the resume contains multiple names or complex name structures. Different cultures and naming conventions further add to the complexity. To address this, we can utilize approaches like named entity recognition (NER) using machine learning models or rule-based matching. However, it’s important to handle different naming conventions properly to ensure accurate extraction.

Contact Information Extraction

Extracting contact information such as phone numbers and email addresses can be prone to false positives or missing details. Regular expressions can be helpful for pattern matching, but they may not cover all possible variations. To enhance accuracy, we can incorporate robust validation techniques or leverage third-party APIs to verify the extracted contact information.

Skills Extraction

Identifying skills mentioned in the resume accurately is a challenge due to the vast array of possible skills and their variations. Using a predefined list of skills or employing techniques like keyword matching or natural language processing can aid in extracting skills effectively. However, it’s crucial to regularly update and refine the skill list to accommodate emerging skills and industry-specific terminology.

Extracting Education Information

Extracting education details from resumes can be complex as they can be mentioned in various formats, abbreviations, or different orders. Employing a combination of regular expressions, keyword matching, and contextual analysis can help identify education information accurately. It’s essential to consider the limitations of pattern matching and handle variations appropriately.

Handling Multilingual Resumes

Dealing with resumes in different languages adds another layer of complexity. Language detection techniques and language-specific parsing and extraction methods enable the handling of multilingual resumes. However, it’s crucial to ensure language support for the libraries or models used in the parser.

When developing a resume parser, combining techniques like rule-based matching, regular expressions, and natural language processing can enhance information extraction accuracy. We recommend testing and refining the parser by using diverse resume samples to identify and address potential issues. Consider utilizing open-source resume parser libraries like spaCy or NLTK, which offer pre-trained models and components for named entity recognition, text extraction, and language processing. Remember, building a robust resume parser is an iterative process that improves with user feedback and real-world data.

Conclusion

In conclusion, resume parsing with spaCy offers significant benefits for recruiters by saving time, streamlining the hiring process, and enabling more informed decisions. Techniques such as text extraction, contact detail capturing, and leveraging spaCy’s pattern matching with regular expressions and keyword matching ensure accurate retrieval of information, including skills, education, and candidate names. Hands-on experience confirms the practical application and potential of resume parsing, ultimately revolutionizing recruitment practices. By implementing a spaCy resume parser, recruiters can enhance efficiency and effectiveness, leading to better hiring outcomes.

Remember that building a resume parser requires a combination of technical skills, domain knowledge, and attention to detail. With the right approach and tools, you can develop a powerful resume parser that automates the extraction of crucial information from resumes, saving time and effort in the recruitment process.

Frequently Asked Questions

Q1. What is resume parsing?

A. Resume parsing is a technology that allows automated extraction and analysis of information from resumes. It involves parsing or breaking down a resume into structured data, enabling recruiters to efficiently process and search through a large number of resumes.

Q2. How does resume parsing work?

A. Resume parsing typically involves using natural language processing (NLP) techniques to extract specific data points from resumes. It utilizes algorithms and rule-based systems to identify and extract information such as contact details, skills, work experience, and education.

Q3. What are the benefits of using resume parsing?

A. Resume parsing offers several benefits, including time-saving for recruiters by automating the extraction of critical information, improved accuracy in capturing data, streamlined candidate screening and matching, and enhanced overall efficiency in the recruitment process.

Q4. What challenges can arise with resume parsing?

A. Some challenges in resume parsing include accurately interpreting and extracting information from resumes with varying formats and layouts, dealing with inconsistencies in how candidates present their information, and handling potential errors or misinterpretations in the parsing process.

Q5. Are there specialized tools or software for resume parsing?

Yes, there are various specialized tools and software available for resume parsing. Some popular options include Applicant Tracking Systems (ATS), which often include resume parsing capabilities, and dedicated resume parsing software that can integrate with existing recruitment systems.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sanket Sarwade 12 Jul 2023

Advanced Guide NLP Python Regex

The Resume Parser for Extracting Information with SpaCy’s Magic

Introduction

Learning Objectives

Table of contents

What is SpaCy?

Setting up the Development Environment

Extracting Text from Resumes

Extracting Contact Information

Function to Extract

Pattern Components

Find Contact Number with Country Code

Extracting Email Address

Extracting Skills

Extracting Education

Extracting Name Using spaCy

Parsing a Sample Resume

Challenges in Resume Parcer Development

Accurate Text Extraction

Dealing with Formatting Variations

Extracting Names

Contact Information Extraction

Skills Extraction

Extracting Education Information

Handling Multilingual Resumes

Conclusion

Frequently Asked Questions

Frequently Asked Questions

Responses From Readers

Write for us