What would be the best practice in terms of AWS for the following:
- Many IOT medical devices gather data at around 256kBps
- The data is a time series data (a matrix of [Channels X Samples], there can be millions of samples and dozens of channels)
- Data is saved into files in S3 and each session is logged in a database with some metadata. So far we are using RDS for this.
- Each dataset is around 5GB
- We have access to the datasets and would like to run some analysis flow:
- Access the data file
- Analysis step:
- Execute code (version managed) that accepts the data file and produces a result (another file or a JSON)
- Register the analysis step in some database (which?) and register the result (if a file is produced, register its location)
- Perform N more analysis steps in a similar manner. Analysis steps can depend on each other, but can also be parallel.
- The result of the N'th step is the end result of the analysis flow.
The idea is to provide an easy way to run code on data in AWS without actually downloading the files and keep a log of what analysis was performed on the data.
Any ideas which services and databases to use? How to pass the data around?
What would be an easy to use interface to the data scientist who works with Python for example?
I have the following idea in mind:
- Analysis steps are managed code repos in CodeCommit (can be containers)
- Data scientists define flows (in JSON format)
- When a data scientist gives the order his flow is executed
- The flow is registered as an entry in a database
- A flow manager distributes the flows between execution agents
- An agent is a mechanism that gets the flow, pulls the data and containers, and executes the flow
- Each agent registers each step in the flow into a database
Examples of analysis steps:
- Filtering
- Labelling of artifacts in the data (timestamps)
- Calculation of statistical parameters