Annulleret

Extract data from SEC EDGAR website (from Form 10-K; 82,955 .txt links provided)

I’m interested in collecting information about employee unionization from public company annual reports (Form 10-K). Each 10-K contains several standardized sections. The labor union info I’m interested in is located in “Item 1. Business” and “Item 1A. Risk Factors”.

Step 1: Access the links to Form 10-K .txt files (N=82955) in the attached read file and search Item 1 and Item 1A ONLY for the keywords below…

KEYWORDS: collective bargaining, collective-bargaining, CBA, labo(u)r union(s), labo(u)r agreement(s), labo(u)r contract(s), labo(u)r organization(s), union agreement(s), union contract(s), union organization(s), or union(s)

Step 2: If one of the above keywords matches the text in Item 1 or 1A, add the entire sentence (or paragraph, whichever is easier) with the match to new field/column in the read file. Maybe the output file could have one field for any Item 1 output and a second field for any Item 1A output.

Appendix C and Appendix D of the attached research paper (pg36-37) provide some examples of the union-related text that I’m looking for.

Step 3: (If possible) Create 3 different union-related variables (binary, percentage, number) from the extracted Item 1 sentences/paragraphs. Create a separate set of 3 union-related variables from the Item 1A text. First, identify whether the union-related statement is positive or negative (i.e. employees are represented/covered by a union V.S. employees are NOT in a union, none of our employees are represented) with a binary variable (=1 for (some) union representation and =0 for no representation). Second, extract the percentage of employees covered if available. Third, extract the number of employees covered if available.

I realize this last part is tricky to do mechanically. I’ll have to check this part manually anyways, so any progress here with a reasonable error rate will be appreciated.

I've revised my original description in an effort to make the project goals more clear. Please see the updated description below:

Alright so the goal is to look at a company’s annual report to shareholders (Form 10-K) and identify the following:
1) whether its employees are unionized (0=no, 1=yes)
2) the number of employees represented by a union (if available)
3) the percentage of employees represented by a union (if available)

This labor union info will be located within the text one of the following standardized sections of Form 10-K: “Item 1. Business” or “Item 1A. Risk Factors”.

STEP 1: Access the 82,955 linked 10-K documents (.txt) in the attached csv file, and search the text in Item 1 and Item 1A ONLY for the keywords below. The parentheses ( ) indicate variations of the word, such as "labor" and "labour".

KEYWORDS: collective bargaining, collective-bargaining, CBA, labo(u)r union(s), labo(u)r agreement(s), labo(u)r contract(s), labo(u)r organization(s), union agreement(s), union contract(s), union organization(s), or union(s)

STEP 2: For each match between one of the keywords and the 10-K text, extract the sentence (or paragraph, whichever is easier) containing the matching keyword. Do this separately for Item 1 and Item 1A. If there are multiple matches leading to the same sentence, you can delete them. I only need a record of unique (different) sentences containing the union-related keywords above.

Appendix C and Appendix D of the attached research paper (pg36-37) provide some examples of the union-related text that I’m looking for.

STEP 3: Parse the extracted text to create the three output variables listed above (binary 0/1, employee count, employee percentage). Again, do this separately for the Item 1 and Item 1A text. To classify a company as union or non-union, you’ll have to look at the context and whether it’s positive or negative (i.e. "employees are represented by a union" V.S. "employees are NOT in a union", "none of our employees are represented"). Some suggestions...for the employee count, extracting the number immediately preceding the word “employees” would be a good start. Similarly, extracting the number immediately preceding “%” or “percent” for the last output.

The output file (in .csv, .xls, or .xlsx) should contain the following items:
1) the identifier from the input file
2a) all unique sentences from Item 1 with any of the keywords in them (multiple sentences can be combined into one cell or field so that each identifier and link has only 1 row of output)
2b) binary union variable (0=no, 1=yes) for the Item 1 text
2c) Number of union employees from the Item 1 text
2d) Percentage of union employees from the Item 1 text
3a) all unique sentences from Item 1A with any of the keywords in them (multiple sentences can be combined)
3b) binary union variable (0=no, 1=yes) for the Item 1 text
3c) Number of union employees from the Item 1 text
3d) Percentage of union employees from the Item 1 text

Evner: Datasøgning, Web Skrabning

Se mere: edgar api python, scraping edgar with python, download sec filings into excel, sec edgar daily, sec ftp, sec data, sec-edgar-crawler, sec edgar database api, data entry from website form to ms excel spreadsheet, extract data password protected website excel, extract data sec edgar, extract data yellow pages website, export data excel internet website form, extract data form website, extract data table website

Om arbejdsgiveren:
( 0 bedømmelser ) State College, United States

Projekt ID: #17178642

11 freelancere byder i gennemsnit $191 på dette job

zekovicm

Hi there,I am Miljan,Web Scraping expert from Bosnia & Herzegovina,Europe. I have carefully gone through with your requirements and I would like to help you with this job ! I can start immediately and finish it within Flere

$155 USD in 2 dage
(48 bedømmelser)
6.3
SBITServices

Hello JRESEARCH, I have checked the data in the given links and i can write macro script to get data from these files. I have 5 years experience in Excel VBA programming and can complete task within Flere

$200 USD in 3 dage
(83 bedømmelser)
6.3
$166 USD in 3 dage
(3 bedømmelser)
5.6
zhangyingtai

Hello I am a qualified python developer with 8 years of professional experience. Especially I have rich experiences of web scraping of SEC EDGAR website. I got you what you want and I am able to finish the project i Flere

$250 USD in 3 dage
(14 bedømmelser)
5.5
$100 USD in 3 dage
(12 bedømmelser)
3.9
MSNEnterprise

Hello there, Hope you are doing well and thanks for reviewing our proposal. We reviewed the job requirement thoroughly and would like to assist by offering our services related to scraping and data extraction. Flere

$250 USD in 5 dage
(3 bedømmelser)
2.1
$150 USD in 7 dage
(2 bedømmelser)
2.0
harshal109

let's discuss in chat

$250 USD in 7 dage
(0 bedømmelser)
0.0
shyamalaam

Dear Employer, I have gone through your requirement and i would like to apply for your data mining job. i have previous similar work experience to finish the work on time and within budget. Happy to hear from yo Flere

$155 USD in 3 dage
(0 bedømmelser)
0.0
ricardopdmcruz

I am a machine learning researcher. I have already done scrapping in Facebook (mobile version) and other websites. One job I did was to scrap data from [login to view URL] I usually keep it simple and Flere

$222 USD in 14 dage
(0 bedømmelser)
0.0
$200 USD in 7 dage
(0 bedømmelser)
0.0