Lukket

CANNOT SCALE BIG DATA PROCESSING

Job Description:

I built an ETL pipeline to process terabytes of data. To achieve that goal, I setup a Spark Cluster (Scala) and MinIO server for object data storage.

I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing.

The issue I have is that I am not able to scale that Processing. Meaning if I double the number of spark virtual machines, this does not affect processing time.

I need a Data Architect who has enough expertise to help me identify the bottleneck and fix the issue.

ARCHITECTURE SUMMARY.

• I use virtual machines set up on-premises using VMWare ESXi 6

• Physical machines (which host VMs) are on a 1 GB network.

• There is no over commitment for vCPU nor RAM

• Spark VMs. 16VCPU, 64 GB RAM

• MinIO (Storage). 16vCPU, 64GB RAM, Configured using RAID0

SOME DETAILS ABOUT DATA PROCESSING

The process is straight.

• Read data from 2 sources on MinIO,

• Make a Union of data of two sources,

• Filter out empty values on a column from resulting dataset,

• Apply 2 groupby on that column (We save intermediate values after the first groupby)

• Union the dataset obtained after the groupby operation with the empty columns values

• Save the whole again on MinIO

Færdigheder: VMware, Spark, Data Engineer, Amazon S3, Big Data

Om klienten:
( 5 bedømmelser ) SAINT DENIS, France

Projekt ID: #35893478

5 freelancere byder i gennemsnit €334 timen for dette job

ITMed

Hi there,I am excited to share my expertise and skills in data engineering and Big data, which I have acquired over the past 3 years. I am confident that I can meet your requirements. I would be delighted to work with Flere

€140 EUR in 5 dage
(1 bedømmelse)
0.7
rashidamjad

Hi there, How are you? I have gone through your project details. I would like to tell you that l have a great bunch of experience in VMware, Spark, Data Engineer, Big Data and Amazon S3. For that I would require from Flere

€250 EUR in 8 dage
(0 bedømmelser)
0.0
priyanshusing269

Hi Saint Denis, I am a Data Engineer with 7+year of experience. I would like to offer you help to fix this issue. Please let me know if we can connect .

€140 EUR in 7 dage
(0 bedømmelser)
0.0
singhmithilesh60

Hi, I hv ,,10 years of exp in this. I would like to work for you. As i have already did the similar task and supported many projects/person in the same way etc. I would like to hear from your side.  Thank you for

€140 EUR in 7 dage
(1 bedømmelse)
0.1
happydroid

Hi, I am a data engineer of 5 years experience. I have designed and built large scale spark pipelines for use cases similar to yours. Unfortunately as you might be aware there are no straight forward answer to your pro Flere

€1000 EUR in 15 dage
(0 bedømmelser)
0.0