We are developing an algorithm to react to certain value changes, right now we are using pure Python
for calculation results but would like to start using data analysis libraries such as Numpy and Pandas
for faster compilations. To make multiple tests changing variables we will need to optimize our code/model
with vectorization or with better data preparation in general.
Each data entry in the database is approximately 30 seconds apart from each other.
This means they should look like this (showing relevant values only).
date: '06/18/2019 20:00:00',
date: '06/18/2019 20:00:30',
date: '06/18/2019 20:01:01',
At the start of our algorithm we get all the data necessary from the DB in a single array
and iterate through it comparing them with an emulated date (variable, ex. 15 days from now) and adding 30 seconds each loop,
this emulates the calculations like if it was live on that date.
Through each loop we make a series of calculation, making a backtest of 2 week takes in average
8 minutes but would like to reduce that number as much as possible.
For the algorithm to run correctly we need to get an array of objects, each object contains the following from each entry:
date: (Date object of data gathered),
With this data we join them in periods (variable, example: 10 periods of data) divided in a
defined timeFrame (variable, example: periods of 15 minutes each). In each period we will insert
all data where [login to view URL] is between that timeFrame.
For each period we need to calculate average of data1, data2, and data1 + data2, also getting the highest
value (peak) of each data value in every period resulting on each period generating an object like this:
Once we have all the averages and peak values of each period then we proceed to calculate collective
averages of all the periods results. For example sum(period[avgData1] for period in periods) / [login to view URL],
sum(period[avgData2] for period in periods) / [login to view URL], ...
Final result will return an object like this:
Translate this algorithm with Numpy or Pandas and reduce the compilation time for big data analysis.
We've tried putting all data of each period in independent Numpy arrays and calculating averages
then but the results took longer, maybe we are not using Numpy as intended.