Færdiggjort

Fast line extraction from big file with index and fseek (C / C++)

We have very big files (million lines) having each line in the format "NAME | CODE". We want a program creating an index of this big files associating each name to the position of the line, and a second program that loads the index in RAM and printing all lines (NAME | CODE) giving a list of names.

## Deliverables

# Two programs for fast line extraction {#[url removed, login to view]}

## Overview

We have very big files (~15 Gb) as [url removed, login to view] in "Files involved" section. We need to extract selected lines by name.

Two programs are needed: one creating an index of a big file and the second to extract selected line from the big file loading in RAM the index created by the former. Speed is required, we currently use the "SortedSeek" module in a Perl script, but we would prefere to load an index in RAM.

Usage example:

$ createindex -i [url removed, login to view] -o [url removed, login to view]

(create an index containing the position of each code in the "100MLINES" file, see below)

$ cat [url removed, login to view] | getcode -i [url removed, login to view] -idx [url removed, login to view]

(get from STDIN a list of names, loads the whole index in RAM, then and prints the corresponding line from the [url removed, login to view] file)

## Files involved

FILE "[url removed, login to view]" (about 15Gb, each line in the format: >NAME | CODE)

>1000_1000_1002_R3 | G10333100310122330222202213203103002222131100102220

>1000_1000_1009_R3 | G03010332203011130031230101223101331200202121002220

>1000_1000_1089_R3 | G13130003031232203200122031001311132300201300313103

>1000_1000_108_R3 | G32120313001110122020213221333022301212310123332223

>1000_1000_1097_R3 | G10222022213110200222303023212110220021122222011030

>1000_1000_112_R3 | G30013020030232022220132330033202003330332033331003

>1000_1000_1165_R3 | G30000332103221321020032013023202300122222001101000

>1000_1000_116_R3 | G02001313223323211030231221120003222210021130132200

>1000_1000_1269_R3 | G21211113013120131000121301123011111012111131131111

>1000_1000_1292_R3 | G11330031313322212101000003303011121013300112222202

?

FILE "[url removed, login to view]" (some thousand lines)

1_320_1403_R3

10_340_243_R3

245_3002_13_R3

?

* * *This broadcast message was sent to all bidders on Tuesday May 24, 2011 10:36:41 AM:

Dear bidders, many of you requested a test set. I prepared a subset of the "BIG FILE" and a short list of names to be retrieved. [url removed, login to view] Consider that the real "big" file is >100 million reads long. It has been sorted using $ env LC_COLLATE=C sort file > sortedfile can be used for small scale testing Cheers, Andrea

Evner: C programmering, Ingeniørarbejde, Linux, Mac OS, Projekt Ledelse, Script Installering, Shell Script, Software Arkitektur, Software Testning

Se mere: fast line, fast file create fseek, subset test, index.idx, csx.com, csx com, csx, broadcast programming, ram c, programming c/c++, programming c c++, c/c, c++c, C++ C#, c#c#, c# c++, c c++, c c++ c#, c c, c & c++

Om arbejdsgiveren:
( 1 bedømmelse ) padova, Italy

Projekt ID: #3334112

Tildelt til:

GS94

See private message.

$26 USD in 5 dage
(28 bedømmelser)
4.2

7 freelancere byder i gennemsnit $59 på dette job

mastirlaa

See private message.

$59.5 USD in 5 dage
(74 bedømmelser)
6.1
matfizvw

See private message.

$72.25 USD in 5 dage
(55 bedømmelser)
6.0
cristy88

See private message.

$63.75 USD in 5 dage
(12 bedømmelser)
3.4
richardjs

See private message.

$68 USD in 5 dage
(5 bedømmelser)
2.8
Anurag7

See private message.

$72.25 USD in 5 dage
(4 bedømmelser)
2.7
misebal

See private message.

$51 USD in 5 dage
(0 bedømmelser)
0.0