**3. Quality of data (QOD)**

My experience dealing with data of different types and categories spans over four decades. From attending a survey technician-training program after high school to studying in an engineering school, data management has played and continues to play a very significant role in my professional life. As well, the challenges encountered over this period of time continue to evolve exponentially! The most recent paradigm transformation in data management is in the proliferation of analytics — a domain that has enabled businesses, industry, academia, banks, etc. to exhale and address competing forces with might and vitality.

One adage that strongly and appropriately describes different forms of data is "garbage in garbage out" (GIGO). Interestingly, this adage is not just limited to conventional data as described in the previous paragraph—it also includes a human dimension. For example, healthy eating habits correlate positively with an improved quality of life and health.

The importance and significance of good data cannot be adequately emphasized in general, and more specifically and critically in data-intensive methodologies like analytics.

Here is a personal and professional life case study example. In 1992, Columbia University (CU) recruited me as a Senior Data Management Advisor. My very first assignment was to recalculate the incidence rate of HIV/AIDS. Four years earlier, CU had launched a project that was primarily managing an open HIV/AIDS cohort. That is a population of interest that recruited new members as the study progressed.

The project's focus was to manage a cohort of over 13,000 participants and produce periodic reports (in this case every six months) on the dynamics of the epidemic. The milestones were morbidity rates — incidence and prevalence.

The week when my assignment began coincided with a scientific conference in Holland where Dr. Maria Wawer (my boss) and other colleagues were presenting papers on the project findings. During that first week of the conference, Dr. Wawer contacted me to inquire about what incidence rates I had come up with. In the meantime, because of my limited knowledge of the data set, I recruited two experts who had been with the project as consultants during and since its interception. I identified what I believed were the most critical issues to be addressed before starting the computations and subsequent analysis.

The team was then assigned specific tasks. These included cleaning the relevant data set: generating frequency tables; identifying outliers; triangulating with both source data (original questionnaires), laboratory technicians (serology test results), and survey team members. After completing this cleaning and validation process (including correcting the numerous inconsistencies), we proceeded to perform the calculations using the statistical package — Statistical Package for Social Sciences (SPSS). This phase of the assignment went very well. After compiling the results, I then submitted the findings (as earlier agreed) to Dr. Wawer who was still at the

conference in Holland. The recalculated rates this time were one infected case lower than what was being presented at the conference. And that, as it turned out, was a big deal! I received immediate feedback as anticipated, highlighting the fact that I was new to the project team with a limited understanding of the data sets.

During one of our weekly team meetings (post-conference), primarily to review what had gone wrong with our incidence rate, one of my colleagues was so embarrassed and distraught that he started shedding tears. Since no amount of consolation could calm him the meeting was immediately adjourned. In the meantime, members of a similar and "competing" project were constantly and consistently asking us what the real incidence rate was. What should they quote in their papers? As the message continued to spread, our team agreed on a consensus response, which was that the team was still in the review and validation process after which the final and latest incidence rates would be disclosed. This resolution served very well in mitigating further concerns.

During this process, our team went back to the drawing board to confirm what the real rates were. After our earlier computations and part of the triangulation process, we had actually conducted a recount of the new infections. The numbers were consistent with our findings. This recounting exercise was again conducted in addition to further calculations. And this time every degree of verification confirmed our results: there was one infected case too many!

And what is the message? PAAs and other quantitative methods are only as valid, reliable, and useful as the quality of data used.
