Scalability is an important problem in epidemiological applications that simulate complex intervention scenarios over large datasets.
Indemics is one such interactive data intensive framework for High-performance computing(HPC) based large-scale epidemic simulations.
In the Indemics framework, interventions are supplied from an external, standalone database which proved to be an effective way of implementing interventions. Although this setup performs well for simple interventions and small datasets, performance and scalability of complex interventions and large datasets remain an issue.
In this thesis, we present IndemicsXC, a scalable and massively parallel high-performance data engine for Indemics in a supercomputing environment.
IndemicsXC has the ability to implement complex interventions over large datasets.
Our distributed database solution retains the simplicity of Indemics by using the same SQL query interface for expressing interventions.
We show that our solution implements the most complex interventions by intelligently offloading them to the supercomputer nodes and processing them in parallel.
We present an extensive performance evaluation of our database engine with the help of various intervention case studies over synthetic population datasets. The evaluation of our parallel and distributed database framework illustrates its scalability over standalone database.
Our results show that the distributed data engine is efficient as it is parallel, scalable and cost-efficient means of implementing interventions.
The proposed cost-model in this thesis could be used to approximate intervention query execution time with decent accuracy.
The usefulness of our distributed database framework could be leveraged for fast, accurate and
sensible decisions by the public health officials during an outbreak.
Finally, we discuss the considerations for using distributed databases for driving large-scale simulations.