Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Future Blog Post

Published:

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

projects

publications

Predictions, role of interventions and effects of a historic national lockdown in India’s response to the COVID-19 pandemic: data science call to arms

Published in Harvard Data Science Review, 2020

Abstract: With only 536 COVID-19 cases and 11 fatalities, India took the historic decision of a 21-day national lockdown on March 25, 2020. The lockdown was first extended to May 3 soon after the analysis of this article was completed, and then to May 18 while this article was being revised. In this article, we use a Bayesian extension of the susceptible-infected-removed (eSIR) model designed for intervention forecasting to study the short- and long-term impact of an initial 21-day lockdown on the total number of COVID-19 infections in India compared to other, less severe nonpharmaceutical interventions. We compare effects of hypothetical durations of lockdown on reducing the number of active and new infections. We find that the lockdown, if implemented correctly, can reduce the total number of cases in the short term, and buy India invaluable time to prepare its health care and disease-monitoring system. Our analysis shows we need to have some measures of suppression in place after the lockdown for increased benefit (as measured by reduction in the number of cases). A longer lockdown from 42–56 days is preferable to substantially ‘flatten the curve’ when compared to 21–28 days of lockdown. Our models focus solely on projecting the number of COVID-19 infections and thus inform policymakers about one aspect of this multifaceted decision-making problem. We conclude with a discussion on the pivotal role of increased testing, reliable and transparent data, proper uncertainty quantification, accurate interpretation of forecasting models, reproducible data science methods, and tools that can enable data-driven policymaking during a pandemic. Our software products are available at covind19.org.

Recommended citation: Ray, D., Salvatore, M., Bhattacharyya, R., Wang, L., Du, J., Mohammed, S., … Mukherjee, B. (2020). Predictions, Role of Interventions and Effects of a Historic National Lockdown in India’s Response to the the COVID-19 Pandemic: Data Science Call to Arms. Harvard Data Science Review. https://doi.org/10.1162/99608f92.60e08ed5 https://hdsr.mitpress.mit.edu/pub/r1qq01kw/release/6

Are women leaders significantly better at controlling the contagion during the COVID-19 pandemic?

Published in Journal of Health and Social Sciences, 2020

Abstract: Recent media articles have suggested that women-led countries are doing better in terms of their responses to the COVID-19 pandemic. We examine an ensemble of public health metrics to assess the control of COVID-19 epidemic in women-versus men-led countries worldwide based on data available up to June 3. The median of the distribution of median time-varying effective reproduction number for women- and men-led countries were 0.89 and 1.14 respectively with the 95% two-sample bootstrap-based confidence interval for the difference (women – men) being [−0.34, 0.02]. In terms of scale of testing, the median percentage of population tested were 3.28% (women), 1.59% (men) [95% CI: (−1.29%, 3.60%)] with test positive rates of 2.69% (women) and 4.94% (men) respectively. It appears that though statistically not significant, countries led by women have an edge over countries led by men in terms of public health metrics for controlling the spread of the COVID-19 pandemic worldwide.

Recommended citation: Purkayastha S, Salvatore M, Mukherjee B. Are women leaders significantly better at controlling the contagion during the COVID-19 pandemic? J Health Soc Sci. 2020 Jun;5(2):231-240. PMID: 32875269; PMCID: PMC7457824. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7457824/

A Review of Multi-Compartment Infectious Disease Models

Published in International Statistical Review, 2020

Abstract: Multi-compartment models have been playing a central role in modelling infectious disease dynamics since the early 20th century. They are a class of mathematical models widely used for describing the mechanism of an evolving epidemic. Integrated with certain sampling schemes, such mechanistic models can be applied to analyse public health surveillance data, such as assessing the effectiveness of preventive measures (e.g. social distancing and quarantine) and forecasting disease spread patterns. This review begins with a nationwide macromechanistic model and related statistical analyses, including model specification, estimation, inference and prediction. Then, it presents a community-level micromodel that enables high-resolution analyses of regional surveillance data to provide current and future risk information useful for local government and residents to make decisions on reopenings of local business and personal travels. r software and scripts are provided whenever appropriate to illustrate the numerical detail of algorithms and calculations. The coronavirus disease 2019 pandemic surveillance data from the state of Michigan are used for the illustration throughout this paper.

Recommended citation: Tang, L., Zhou, Y., Wang, L., Purkayastha, S., Zhang, L., He, J., Wang, F., Song, P.X. ‐K., 2020. A Review of Multi‐Compartment Infectious Disease Models. International Statistical Review. https://doi.org/10.1111/insr.12402 https://onlinelibrary.wiley.com/doi/full/10.1111/insr.12402

A Spatiotemporal Epidemiological Prediction Model to Inform County-Level COVID-19 Risk in the United States

Published in Harvard Data Science Review, 2020

Abstract: As the COVID-19 pandemic continues worsening in the United States, it is of critical importance to develop a health information system that provides timely risk evaluation and prediction of the COVID-19 infection in communities. We propose a spatiotemporal epidemiological forecast model that combines a spatial cellular automata (CA) with a temporal extended susceptible-antibody-infectious-removed (eSAIR) model under time-varying state-specific control measures. This new toolbox enables the projection of the county-level COVID-19 prevalence over 3109 counties in the continental United States, including -day-ahead risk forecast and the risk related to a travel route. In comparison to the existing temporal risk prediction models, the proposed CA-eSAIR model informs the projected county-level risk to governments and residents of the local coronavirus spread patterns and the associated personal risks at specific geolocations. Such high-resolution risk projection is useful for decision-making on business reopening and resource allocation for COVID-19 tests.

Recommended citation: Zhou, Y., Wang, L., Zhang, L., Shi, L., Yang, K., He, J., … Song, P. (2020). A Spatiotemporal Epidemiological Prediction Model to Inform County-Level COVID-19 Risk in the United States. Harvard Data Science Review. https://doi.org/10.1162/99608f92.79e1f45e https://hdsr.mitpress.mit.edu/pub/qqg19a0r/release/2

On minimum Bregman divergence inference

Published in arXiv, 2020

Abstract: In this paper a new family of minimum divergence estimators based on the Bregman divergence is proposed. The popular density power divergence (DPD) class of estimators is a sub-class of Bregman divergences. We propose and study a new sub-class of Bregman divergences called the exponentially weighted divergence (EWD). Like the minimum DPD estimator, the minimum EWD estimator is recognised as an M-estimator. This characterisation is useful while discussing the asymptotic behaviour as well as the robustness properties of this class of estimators. Performances of the two classes are compared – both through simulations as well as through real life examples. We develop an estimation process not only for independent and homogeneous data, but also for non-homogeneous data. General tests of parametric hypotheses based on the Bregman divergences are also considered. We establish the asymptotic null distribution of our proposed test statistic and explore its behaviour when applied to real data. The inference procedures generated by the new EWD divergence appear to be competitive or better that than the DPD based procedures.

Recommended citation: Purkayastha, Soumik and Basu, Ayanendranath: “On minimum Bregman divergence inference”, 2020; arXiv:2008.06987. https://arxiv.org/abs/2008.06987

Prediction of monthly Hilsa (Tenualosa ilisha) catch in the Northern Bay of Bengal using Bayesian structural time series model

Published in Regional Studies in Marine Science, 2020

Abstract: Monthly catch of Hilsa in the Northern Bay of Bengal was modeled and predicted using the Bayesian structural time series (BSTS) with chlorophyll and rainfall as explanatory covariates. Chlorophyll and rainfall were used in the model with a lag time of eleven months and one month, respectively. These lag time values were determined from the exploratory data analysis. In this study, four separate models were developed to predict the Hilsa catch and compare the results. In the first model, the catch was predicted based on trend and seasonality of the previous years’ catch data. The second and third models predicted monthly catch with chlorophyll (having a lag time of 11 months) and rainfall (having a lag time of 1 month), respectively, as the only covariate in the model. The fourth model had both chlorophyll and rainfall as covariates with a lag time of 11 months and 1 month, respectively. The training mean absolute percentage error (MAPE) of the fourth model (measuring model fit) was observed as 0.2827, while the test MAPE (measuring model accuracy) was 0.2786. The coefficient values for both chlorophyll and rainfall were very close (0.999 and 1.064) which indicated the almost equal effect of both the parameters on the Hilsa catch. This study indicated that more abundance of the food material (phytoplankton) during the post-monsoon season would promote the growth and maturity of more Hilsa that would successively migrate towards river or estuary during the subsequent monsoon season and the catch would likely increase with the intensification of the southwest monsoon.

Recommended citation: Sandip Giri, Soumik Purkayastha, Sugata Hazra, Abhra Chanda, Isha Das, Sourav Das. Prediction of monthly Hilsa (Tenualosa ilisha) catch in the Northern Bay of Bengal using Bayesian structural time series model. Regional Studies in Marine Science, Volume 39, 2020, 101456, ISSN 2352-4855, https://doi.org/10.1016/j.rsma.2020.101456. https://www.sciencedirect.com/science/article/pii/S2352485520305843

Comprehensive public health evaluation of lockdown as a non-pharmaceutical intervention on COVID-19 spread in India: national trends masking state-level variations

Published in BMJ Open, 2020

Abstract: Objectives: To evaluate the effect of four-phase national lockdown from March 25 to May 31 in response to the COVID-19 pandemic in India and unmask the state-wise variations in terms of multiple public health metrics. Design: Cohort study (daily time series of case counts). Setting: Observational and population based. Participants: Confirmed COVID-19 cases nationally and across 20 states that accounted for >99% of the current cumulative case counts in India until 31 May 2020. Exposure: Lockdown (non-medical intervention). Main outcomes and measures: We illustrate the masking of state-level trends and highlight the variations across states by presenting evaluative evidence on some aspects of the COVID-19 outbreak: case fatality rates, doubling times of cases, effective reproduction numbers and the scale of testing. Results: The estimated effective reproduction number R for India was 3.36 (95% CI 3.03 to 3.71) on 24 March, whereas the average of estimates from 25 May to 31 May stands at 1.27 (95% CI 1.26 to 1.28). Similarly, the estimated doubling time across India was at 3.56 days on 24 March, and the past 7-day average for the same on 31 May is 14.37 days. The average daily number of tests increased from 1717 (19–25 March) to 113.372 (25–31 May) while the test positivity rate increased from 2.1% to 4.2%, respectively. However, various states exhibit substantial departures from these national patterns. Conclusions: Patterns of change over lockdown periods indicate the lockdown has been partly effective in slowing the spread of the virus nationally. However, there exist large state-level variations and identifying these variations can help in both understanding the dynamics of the pandemic and formulating effective public health interventions. Our framework offers a holistic assessment of the pandemic across Indian states and union territories along with a set of interactive visualisation tools that are daily updated at covind19.org.

Recommended citation: Salvatore M, Basu D, Ray D, et al Comprehensive public health evaluation of lockdown as a non-pharmaceutical intervention on COVID-19 spread in India: national trends masking state-level variations BMJ Open 2020;10:e041778. doi: 10.1136/bmjopen-2020-041778 https://bmjopen.bmj.com/content/10/12/e041778.abstract

A comparison of five epidemiological models for transmission of SARS-CoV-2 in India

Published in BMC Infectious Diseases, 2021

Abstract: Background: Many popular disease transmission models have helped nations respond to the COVID-19 pandemic by informing decisions about pandemic planning, resource allocation, implementation of social distancing measures, lockdowns, and other non-pharmaceutical interventions. We study how five epidemiological models forecast and assess the course of the pandemic in India: a baseline curve-fitting model, an extended SIR (eSIR) model, two extended SEIR (SAPHIRE and SEIR-fansy) models, and a semi-mechanistic Bayesian hierarchical model (ICM). Methods: Using COVID-19 case-recovery-death count data reported in India from March 15 to October 15 to train the models, we generate predictions from each of the five models from October 16 to December 31. To compare prediction accuracy with respect to reported cumulative and active case counts and reported cumulative death counts, we compute the symmetric mean absolute prediction error (SMAPE) for each of the five models. For reported cumulative cases and deaths, we compute Pearson’s and Lin’s correlation coefficients to investigate how well the projected and observed reported counts agree. We also present underreporting factors when available, and comment on uncertainty of projections from each model. Results: For active case counts, SMAPE values are 35.14% (SEIR-fansy) and 37.96% (eSIR). For cumulative case counts, SMAPE values are 6.89% (baseline), 6.59% (eSIR), 2.25% (SAPHIRE) and 2.29% (SEIR-fansy). For cumulative death counts, the SMAPE values are 4.74% (SEIR-fansy), 8.94% (eSIR) and 0.77% (ICM). Three models (SAPHIRE, SEIR-fansy and ICM) return total (sum of reported and unreported) cumulative case counts as well. We compute underreporting factors as of October 31 and note that for cumulative cases, the SEIR-fansy model yields an underreporting factor of 7.25 and ICM model yields 4.54 for the same quantity. For total (sum of reported and unreported) cumulative deaths the SEIR-fansy model reports an underreporting factor of 2.97. On October 31, we observe 8.18 million cumulative reported cases, while the projections (in millions) from the baseline model are 8.71 (95% credible interval: 8.63–8.80), while eSIR yields 8.35 (7.19–9.60), SAPHIRE returns 8.17 (7.90–8.52) and SEIR-fansy projects 8.51 (8.18–8.85) million cases. Cumulative case projections from the eSIR model have the highest uncertainty in terms of width of 95% credible intervals, followed by those from SAPHIRE, the baseline model and finally SEIR-fansy. Conclusions: In this comparative paper, we describe five different models used to study the transmission dynamics of the SARS-Cov-2 virus in India. While simulation studies are the only gold standard way to compare the accuracy of the models, here we were uniquely poised to compare the projected case-counts against observed data on a test period. The largest variability across models is observed in predicting the “total” number of infections including reported and unreported cases (on which we have no validation data). The degree of under-reporting has been a major concern in India and is characterized in this report. Overall, the SEIR-fansy model appeared to be a good choice with publicly available R-package and desired flexibility plus accuracy.

Recommended citation: Purkayastha, S., Bhattacharyya, R., Bhaduri, R. et al. A comparison of five epidemiological models for transmission of SARS-CoV-2 in India. BMC Infect Dis 21, 533 (2021). https://doi.org/10.1186/s12879-021-06077-9 https://bmcinfectdis.biomedcentral.com/articles/10.1186/s12879-021-06077-9

Estimating the wave 1 and wave 2 infection fatality rates from SARS-CoV-2 in India

Published in BMC Research Notes, 2021

Abstract: Objective: There has been much discussion and debate around the underreporting of COVID-19 infections and deaths in India. In this short report we first estimate the underreporting factor for infections from publicly available data released by the Indian Council of Medical Research on reported number of cases and national seroprevalence surveys. We then use a compartmental epidemiologic model to estimate the undetected number of infections and deaths, yielding estimates of the corresponding underreporting factors. We compare the serosurvey based ad hoc estimate of the infection fatality rate (IFR) with the model-based estimate. Since the first and second waves in India are intrinsically different in nature, we carry out this exercise in two periods: the first wave (April 1, 2020–January 31, 2021) and part of the second wave (February 1, 2021–May 15, 2021). The latest national seroprevalence estimate is from January 2021, and thus only relevant to our wave 1 calculations. Results: Both wave 1 and wave 2 estimates qualitatively show that there is a large degree of “covert infections” in India, with model-based estimated underreporting factor for infections as 11.11 (95% credible interval (CrI) 10.71–11.47) and for deaths as 3.56 (95% CrI 3.48–3.64) for wave 1. For wave 2, underreporting factor for infections escalate to 26.77 (95% CrI 24.26–28.81) and to 5.77 (95% CrI 5.34–6.15) for deaths. If we rely on only reported deaths, the IFR estimate is 0.13% for wave 1 and 0.03% for part of wave 2. Taking underreporting of deaths into account, the IFR estimate is 0.46% for wave 1 and 0.18% for wave 2 (till May 15). Combining waves 1 and 2, as of May 15, while India reported a total of nearly 25 million cases and 270 thousand deaths, the estimated number of infections and deaths stand at 491 million (36% of the population) and 1.21 million respectively, yielding an estimated (combined) infection fatality rate of 0.25%. There is considerable variation in these estimates across Indian states. Up to date seroprevalence studies and mortality data are needed to validate these model-based estimates.

Recommended citation: Purkayastha, S., Kundu, R., Bhaduri, R. et al. Estimating the wave 1 and wave 2 infection fatality rates from SARS-CoV-2 in India. BMC Res Notes 14, 262 (2021). https://doi.org/10.1186/s13104-021-05652-2 https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-021-05652-2

SARS-CoV-2 Infection Fatality Rates in India: Systematic Review, Meta-analysis and Model-based Estimation

Published in Studies in Microeconomics, 2021

Abstract: Introduction: Fervourous investigation and dialogue surrounding the true number of SARS-CoV-2-related deaths and implied infection fatality rates in India have been ongoing throughout the pandemic, and especially pronounced during the nation’s devastating second wave. We aim to synthesize the existing literature on the true SARS-CoV-2 excess deaths and infection fatality rates (IFR) in India through a systematic search followed by viable meta-analysis. We then provide updated epidemiological model-based estimates of the wave 1, wave 2 and combined IFRs using an extension of the Susceptible-Exposed-Infected-Removed (SEIR) model, using data from 1 April 2020 to 30 June 2021. Methods: Following PRISMA guidelines, the databases PubMed, Embase, Global Index Medicus, as well as BioRxiv, MedRxiv and SSRN for preprints (accessed through iSearch), were searched on 3 July 2021 (with results verified through 15 August 2021). Altogether, using a two-step approach, 4,765 initial citations were screened, resulting in 37 citations included in the narrative review and 19 studies with 41 datapoints included in the quantitative synthesis. Using a random effects model with DerSimonian-Laird estimation, we meta-analysed IFR1, which is defined as the ratio of the total number of observed reported deaths divided by the total number of estimated infections, and IFR2 (which accounts for death underreporting in the numerator of IFR1). For the latter, we provided lower and upper bounds based on the available range of estimates of death undercounting, often arising from an excess death calculation. The primary focus is to estimate pooled nationwide estimates of IFRs with the secondary goal of estimating pooled regional and state-specific estimates for SARS-CoV-2-related IFRs in India. We also tried to stratify our empirical results across the first and second waves. In tandem, we presented updated SEIR model estimates of IFRs for waves 1, 2, and combined across the waves with observed case and death count data from 1 April 2020 to 30 June 2021. Results: For India, countrywide, the underreporting factors (URF) for cases (sourced from serosurveys) range from 14.3 to 29.1 in the four nationwide serosurveys; URFs for deaths (sourced from excess deaths reports) range from 4.4 to 11.9 with cumulative excess deaths ranging from 1.79 to 4.9 million (as of June 2021). Nationwide pooled IFR1 and IFR2 estimates for India are 0.097% (95% confidence interval [CI]: 0.067–0.140) and 0.365% (95% CI: 0.264–0.504) to 0.485% (95% CI: 0.344–0.685), respectively, again noting that IFR2 changes as excess deaths estimates vary. Among the included studies in this meta-analysis, IFR1 generally appears to decrease over time from the earliest study end date to the latest study end date (from 4 June 2020 to 6 July 2021, IFR1 changed from 0.199 to 0.055%), whereas a similar trend is not as readily evident for IFR2 due to the wide variation in excess death estimates (from 4 June 2020 to 6 July 2021, IFR2 ranged from (0.290–1.316) to (0.241–0.651)%). Nationwide SEIR model-based combined estimates for IFR1 and IFR2 are 0.101% (95% CI: 0.097–0.116) and 0.367% (95% CI: 0.358–0.383), respectively, which largely reconcile with the empirical findings and concur with the lower end of the excess death estimates. An advantage of such epidemiological models is the ability to produce daily estimates with updated data, with the disadvantage being that these estimates are subject to numerous assumptions, arduousness of validation and not directly using the available excess death data. Whether one uses empirical data or model-based estimation, it is evident that IFR2 is at least 3.6 times more than IFR1. Conclusion: When incorporating case and death underreporting, the meta-analysed cumulative infection fatality rate in India varied from 0.36 to 0.48%, with a case underreporting factor ranging from 25 to 30 and a death underreporting factor ranging from 4 to 12. This implies, by 30 June 2021, India may have seen nearly 900 million infections and 1.7–4.9 million deaths when the reported numbers stood at 30.4 million cases and 412 thousand deaths (Coronavirus in India) with an observed case fatality rate (CFR) of 1.35%. We reiterate the need for timely and disaggregated infection and fatality data to examine the burden of the virus by age and other demographics. Large degrees of nationwide and state-specific death undercounting reinforce the call to improve death reporting within India.

Recommended citation: Zimmermann L, Bhattacharya S, Purkayastha S, et al. SARS-CoV-2 Infection Fatality Rates in India: Systematic Review, Meta-analysis and Model-based Estimation. Studies in Microeconomics. November 2021. https://doi.org/10.1177/23210222211054324 https://journals.sagepub.com/doi/10.1177/23210222211054324

Extending the susceptible‐exposed‐infected‐removed (SEIR) model to handle the false negative rate and symptom‐based administration of COVID‐19 diagnostic tests: SEIR‐fansy

Published in Statistics in Medicine, 2022

Abstract: False negative rates of severe acute respiratory coronavirus 2 diagnostic tests, together with selection bias due to prioritized testing can result in inaccurate modeling of COVID-19 transmission dynamics based on reported “case” counts. We propose an extension of the widely used Susceptible-Exposed-Infected-Removed (SEIR) model that accounts for misclassification error and selection bias, and derive an analytic expression for the basic reproduction number $R_0$ as a function of false negative rates of the diagnostic tests and selection probabilities for getting tested. Analyzing data from the first two waves of the pandemic in India, we show that correcting for misclassification and selection leads to more accurate prediction in a test sample. We provide estimates of undetected infections and deaths between April 1, 2020 and August 31, 2021. At the end of the first wave in India, the estimated under-reporting factor for cases was at 11.1 (95% CI: 10.7,11.5) and for deaths at 3.58 (95% CI: 3.5,3.66) as of February 1, 2021, while they change to 19.2 (95% CI: 17.9, 19.9) and 4.55 (95% CI: 4.32, 4.68) as of July 1, 2021. Equivalently, 9.0% (95% CI: 8.7%, 9.3%) and 5.2% (95% CI: 5.0%, 5.6%) of total estimated infections were reported on these two dates, while 27.9% (95% CI: 27.3%, 28.6%) and 22% (95% CI: 21.4%, 23.1%) of estimated total deaths were reported. Extensive simulation studies demonstrate the effect of misclassification and selection on estimation of $R_0$ and prediction of future infections. A R-package SEIRfansy is developed for broader dissemination.

Recommended citation: Bhaduri, R., Kundu, R., Purkayastha, S., Kleinsasser, M., Beesley, L. J., Mukherjee, B. & Datta, J. Extending the susceptible-exposed-infected-removed (SEIR) model to handle the false negative rate and symptom-based administration of COVID-19 diagnostic tests: SEIR-fansy. Statistics in Medicine. 2022; 41( 13): 2317– 2337. doi:10.1002/sim.9357 https://onlinelibrary.wiley.com/doi/full/10.1002/sim.9357

Lessons from SARS-CoV-2 in India: A data-driven framework for pandemic resilience

Published in Science Advances, 2022

Abstract: India experienced a massive surge in SARS-CoV-2 infections and deaths during April to June 2021 despite having controlled the epidemic relatively well during 2020. Using counterfactual predictions from epidemiological disease transmission models, we produce evidence in support of how strengthening public health interventions early would have helped control transmission in the country and significantly reduced mortality during the second wave, even without harsh lockdowns. We argue that enhanced surveillance at district, state, and national levels and constant assessment of risk associated with increased transmission are critical for future pandemic responsiveness. Building on our retrospective analysis, we provide a tiered data-driven framework for timely escalation of future interventions as a tool for policy-makers.

Recommended citation: Salvatore, M., Purkayastha, S., Ganapathi, L., Bhattacharyya, R., Kundu, R., Zimmermann, L., Ray, D., Hazra, A., Kleinsasser, M., Subbaraman, R. & Mukherjee, B. Lessons from SARS-CoV-2 in India: A data-driven framework for pandemic resilience. Science Advances, 2022 (8) 24; doi: https://doi.org/10.1126/sciadv.abp8621 https://www.science.org/doi/10.1126/sciadv.abp8621

Asymmetric predictability in causal discovery: an information theoretic approach.

Published in arXiv, 2022

Abstract: Causal investigations in observational studies pose a great challenge in scientific research where randomized trials or intervention-based studies are not feasible. Leveraging Shannon’s seminal work on information theory, we develop a causal discovery framework of “predictive asymmetry” for bivariate (X,Y). Predictive asymmetry is a central concept in information geometric causal inference; it enables assessment of whether X is a stronger predictor of Y or vice-versa. We propose a new metric called the Asymmetric Mutual Information (AMI) and establish its key statistical properties. The AMI is not only able to detect complex non-linear association patterns in bivariate data, but also is able to detect and quantify predictive asymmetry. Our proposed methodology relies on scalable non-parametric density estimation using fast Fourier transformation. The resulting estimation method is manyfold faster than the classical bandwidth-based density estimation, while maintaining comparable mean integrated squared error rates. We investigate key asymptotic properties of the AMI methodology; a new data-splitting technique is developed to make statistical inference on predictive asymmetry using the AMI. We illustrate the performance of the AMI methodology through simulation studies as well as multiple real data examples.

Recommended citation: Purkayastha, Soumik, and Peter XK Song. "Asymmetric predictability in causal discovery: an information theoretic approach." arXiv preprint arXiv:2210.14455 (2022). https://arxiv.org/pdf/2210.14455

fastMI: a fast and consistent copula-based estimator of mutual information.

Published in arXiv, 2022

Abstract: As a fundamental concept in information theory, mutual information (MI) has been commonly applied to quantify association between random variables. Most existing estimators of MI have unstable statistical performance since they involve parameter tuning. We develop a consistent and powerful estimator, called fastMI, that does not incur any parameter tuning. Using a copula formulation, fastMI estimates MI by leveraging Fast Fourier transformation-based estimation of the underlying density. Extensive simulation studies reveal that fastMI outperforms state-of-the-art estimators with improved estimation accuracy and reduced run time for large data sets. fastMI not only provides a powerful test for independence that controls type I error, it may be used for further inference purposes. We establish asymptotic normality of fastMI for dependent random variables using a new data-splitting analytic argument. Anticipating that fastMI will be a powerful tool in estimating mutual information in a broad range of data, we develop an R package fastMI for broader dissemination.

Recommended citation: Purkayastha, Soumik, and Peter XK Song. "fastMI: a fast and consistent copula-based estimator of mutual information." arXiv preprint arXiv:2212.10268 (2022). https://arxiv.org/pdf/2212.10268

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.