Fast line extraction from big file with index and fseek (C / C++)
$30-85 USD
Igangværende
Slået op næsten 13 år siden
$30-85 USD
Betales ved levering
We have very big files (million lines) having each line in the format "NAME | CODE". We want a program creating an index of this big files associating each name to the position of the line, and a second program that loads the index in RAM and printing all lines (NAME | CODE) giving a list of names.
## Deliverables
# Two programs for fast line extraction {#internal-source-marker_0.6761814581695944}
## Overview
We have very big files (~15 Gb) as [login to view URL] in "Files involved" section. We need to extract selected lines by name.
Two programs are needed: one creating an index of a big file and the second to extract selected line from the big file loading in RAM the index created by the former. Speed is required, we currently use the "SortedSeek" module in a Perl script, but we would prefere to load an index in RAM.
Usage example:
$ createindex -i [login to view URL] -o [login to view URL]
(create an index containing the position of each code in the "100MLINES" file, see below)
$ cat [login to view URL] | getcode -i [login to view URL] -idx [login to view URL]
(get from STDIN a list of names, loads the whole index in RAM, then and prints the corresponding line from the [login to view URL] file)
## Files involved
FILE "[login to view URL]" (about 15Gb, each line in the format: >NAME | CODE)
>1000_1000_1002_R3 | G10333100310122330222202213203103002222131100102220
>1000_1000_1009_R3 | G03010332203011130031230101223101331200202121002220
>1000_1000_1089_R3 | G13130003031232203200122031001311132300201300313103
>1000_1000_108_R3 | G32120313001110122020213221333022301212310123332223
>1000_1000_1097_R3 | G10222022213110200222303023212110220021122222011030
>1000_1000_112_R3 | G30013020030232022220132330033202003330332033331003
>1000_1000_1165_R3 | G30000332103221321020032013023202300122222001101000
>1000_1000_116_R3 | G02001313223323211030231221120003222210021130132200
>1000_1000_1269_R3 | G21211113013120131000121301123011111012111131131111
>1000_1000_1292_R3 | G11330031313322212101000003303011121013300112222202
?
FILE "[login to view URL]" (some thousand lines)
1_320_1403_R3
10_340_243_R3
245_3002_13_R3
?
* * *This broadcast message was sent to all bidders on Tuesday May 24, 2011 10:36:41 AM:
Dear bidders, many of you requested a test set. I prepared a subset of the "BIG FILE" and a short list of names to be retrieved. [login to view URL] Consider that the real "big" file is >100 million reads long. It has been sorted using $ env LC_COLLATE=C sort file > sortedfile can be used for small scale testing Cheers, Andrea