Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

12: Improved Entity Matching #42

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

12: Improved Entity Matching #42

wants to merge 7 commits into from

Conversation

NewAgeAirbender
Copy link

Outlines some of the pain points we've run into with entity matching not being as strong as it could be in OS Core

Copy link

@jessemortenson jessemortenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read and added some suggestions.

One thing that is tricky is that I don't know that many folks on the team have the Open States scrape->import data flow fully in mind as a mental model. I think it would help to include at least one example that illustrates where matching happens in the data flow. (or point to documentation, but I don't think it's documented). Something like:

  • Bill scraper calls add_sponsorship() passing in { "name:" "JOHNSON" }
  • X() calls resolve_person() passing in a pseudo_person_id that is JOHNSON with start/end date values from Y and org_classification from Z
  • resolve_person() constructs a spec that is used to compose filters to query data from the Person model to find a match

012-improve-entity-matching.md Outdated Show resolved Hide resolved
012-improve-entity-matching.md Show resolved Hide resolved
012-improve-entity-matching.md Outdated Show resolved Hide resolved
012-improve-entity-matching.md Outdated Show resolved Hide resolved
012-improve-entity-matching.md Show resolved Hide resolved
012-improve-entity-matching.md Show resolved Hide resolved
012-improve-entity-matching.md Outdated Show resolved Hide resolved
012-improve-entity-matching.md Show resolved Hide resolved
012-improve-entity-matching.md Outdated Show resolved Hide resolved

## Drawbacks

Should absolutely add defaults if we're not certain what's going to be passed in on `core` updates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link

@alexobaseki alexobaseki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff so far but I feel there is still a few clean up to done here. But I recommend we go ahead and create the individual tickets before to circle back to completing the clean up here, this might also help with structure our objectives on the high level better.

jurisdictions have sponsors from both chamber per Bill
- Scrapers: Ensure correct `chamber` is passed in with `add_sponsorship` on Bill Scrapes
- People Script: Update People Script to include name values that may be overwritten as `other_name` options
- People Repo: Add `other_name` values that match scraped name formats for sponsorship or votes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we intend to do this? Maybe using the people matching tool? Explaining how we will arrive at this will be useful.

- if name is set, match on (the rest of the spec) AND (first other_names value matches name) OR (name is exact match)
- if name is NOT set, then just match on rest of spec

IF we go the `other_name` route, the change we'd need to make is:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep other names consistent across board. I see it is other_names in code

being correctly passed in as the `entity_type` in `add_sponsorship()`.
Currently, the `limit_spec` function is used to overwrite the Django default to limit the query parameters. As of right
now, the function:
- If classification is NOT party, then add the jurisdiction_id to the query spec

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This steps is not terrible clear to me. Like what is "Django default", which classification is NOT party, which entity's jurisdiction_id are we adding to the query spec? A little explanation or link to where this changes will be happening would be helpful.


IF we wanted to split up by chamber & type first in `core`, we'd have to add:
- Update [add_participant](https://github.com/openstates/openstates-core/blob/7ac7b73bbb0956f7a539128f9186929509c19550/openstates/scrape/event.py#L140)
and `add_committee` to accept a `chamber` value or `committee_type` of `committee` or `subcommittee` (if `subcommittee`,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a link to add_committee or this is going to be a new function?

add `parent_committee_id`)
- Add that `chamber` value to the `self.org_importer.resolve_json_id` calls in the `EventImporter` on lines [92](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/events.py#L92)
and 101
- In `limit_scope` if classification is `committee`, then add the `chamber_id` to query spec

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me for my ignorance, is this limit_scope or limit_spec.

#### Solutions:
- Core: Fix `limit_spec` on the `OrganizationImporter` so that more than just the first string in `other_names` is checked for
Committees
- People Script: Update Committee Script to include `other_names` for Committees that include Chamber, Type, & Both

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming chamber is like House, Senate, Joint. What is Type and Both?

constructs a spec that is used to compose filters to query data from the Person model to find a match. Could pass in
`org_classification` but currently don't to narrow down via chamber
- If jurisdiction has more than one legislator with the last name "Johnson", Importer will give an error message that
`multiple people returned for spec` but continue through Import task

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am imagining that "multiple people returned for spec" will be limited if you have organization classification in the resolve person query.

Thinking of an idea here:
If there are can we have these type of mismatch save somewhere in the database or a file that can be used in matching tool. I envision a matching tool that is broader than just the unmatched name and list of all the person. Have an idea of the multiple object returned can help quickly resolve the mismatch if and when they happen. This might be a good way to also gather data of mismatch. Some like

unknown_entity | possible_matches | is_resolved |
JOHNSON.         | JOHNSON, Bill.      | False.          |
JOHNSON.         | JOHNSON, Mike.   | False.          |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants