Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint for text data from Budget In Brief - Services Areas #17

Open
MikeTheCanuck opened this issue Feb 20, 2017 · 0 comments
Open

Endpoint for text data from Budget In Brief - Services Areas #17

MikeTheCanuck opened this issue Feb 20, 2017 · 0 comments

Comments

@MikeTheCanuck
Copy link
Collaborator

MikeTheCanuck commented Feb 20, 2017

We have scraped the tabular data for Service Areas from the past two years' Budget in Brief documents - you see that in the Data folder in this repo (https://github.com/hackoregon/team-budget/tree/master/Data).

Next step is to scrape the text data on those Service Area pages to enable us to emit it in an API endpoint, so that it can be rendered inline alongside the OCRB and KPM data. The data includes:

  • intro text just below the Service Area heading (e.g. under Parks, Recreation & Culture Service Area, the intro text is "The Parks, Recreation & Culture service area includes services for Portland Parks & Recreation, the only bureau in this service area. The bureau also administers the Golf program and Portland International Raceway."
  • Significant Issues and Major Projects (SIMP) - this section falls between OCRB table and KPM table, and includes one or more bullets of text

Acceptance Criteria

  • Assumption: the SIMP data must be emitted from the API in a format that ensures that bullets will render in the same order as in the Budget in Brief document.
  • The data probably varies from year to year, so API users must be able to request the appropriate text derived from a specified fiscal year (e.g. user requests all SIMP text from FY2016-17)
  • The data must be captured in a way that reading it directly from Django code, and importing into a database, won't noticeably change the readability of the text by the end user as compared to the experience of reading the Budget in Brief PDF. (e.g. if all bullets for a single Service Area are stored as a single record, then the bullet characters must be encoded in a way that they will automatically show up as bullets in the user's browser)

Any tool that works will do. The tool used to scrape tabular data was Tabula; unknown at the moment if this would work for text data, or if a simple cut-and-paste would work well enough.

Question for City Budget Office contacts: must the SIMP bullets be displayed every time in the same order as they are presented in the Budget in Brief PDF documents?

@MikeTheCanuck MikeTheCanuck changed the title Capture text data from Budget In Brief - Services Areas Endpoint for text data from Budget In Brief - Services Areas Feb 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant