In very simple terms, the analytics process starts with data collection, goes into cleaning, then exploration, model building, and model deployment. Historically, what we’ve been doing is collection and cleaning is usually done by individuals otherwise known as data engineers. So these individuals go out, they collect the data, they scrub it, they get it all ready for analysis. And they’re really our data experts, so they know that data. They’ve really been working with it. After they’re done, they send that off to the data analysts and statisticians.
A data analyst or a statistician will then take that data, explore the relationships, start building the models, deploy their models, check the models, do all of that fancy mathematical work that goes into what an analysis actually is. When data scientists came on the screen– onto the stage to do their thing, what they basically are is the encompassing of both data engineers, analysts, and just topic specialists. So they’re supposed to know the field that the analysis is being run in, and also the mathematical computer science just structure of data and analyses.
They’re supposed to be able to do it all and implement the actual field that this analysis is being done for into the data and the analysis process. It’s like a translation with– into the interdisciplinary model that analysis actually put into. The reason I chose COVID as the topic, or the subject for this analytic presentation is because of the timeliness of it– we’re still in it– and just because of the complexity and insanity that was COVID. To begin, COVID was a very limited time frame, so it took– it started and picked up speed very quickly.
The data that analysts had to work with was all real-time. So data they could have pulled even now right this second is going to be old data tomorrow, the next day. So it’s something that they constantly had to trip over themselves to try to approach and to handle. Another issue with COVID is the inconsistent and inconsistent reporting sources and standards that were going into the collection process. So any two sources weren’t doing the reporting exactly the same. There were no data standards. They weren’t being structured the same. We weren’t collecting the same information, or in the same ways, or with the same understanding of what that information was supposed to tell us. We also had little to no prior knowledge of what COVID was, how it was impacting, us and what it really could do, so we didn’t really know what to test for.
It was kind of all over the place. And then lastly, again, you see this inconsistent reporting sources and standards, but now at the other end of the process, when we’re actually trying to push results out into the world. So that was also fragile and inconsistent because we really didn’t know what exactly we needed to be telling the world. We just need to know we needed to say something. So when looking for a COVID to– COVID data to use in this presentation, I came across a lot of sources. So we have the CDC’s COVID Tracker, HealthData.gov, WHO, Kaggle– lots of different sources, lots of different data, lots of different formats, lots of different reasons as to why the data was collected, lots of people analyzing it. That’s really what I wanted to say about this type of data.
The main thing I want you to pull from this presentation is not that it’s about COVID, but that it’s about the analytic process, and how essential that process is in cases such as the pandemic. We really need to approach it carefully, and with precise understanding and careful implementation, whereas we really weren’t able to do that with the demand for knowledge now. That’s really what I wanted to touch on with this is just understand– or take away that the COVID data that we have is fragile. It’s inconsistent. It’s all over the place, but the analytic process, the things that needed to go into it– that never changed. And please pay attention to that part.
What we’re going to cover in this presentation today is how to choose and import data, some important standards, keys, and best practices, data exploration, data-driven modeling, matching your question to a model, evaluating your model, and then the conclusion. All right, so we’re going, to begin with, choosing and important data. For each of these different topics, I do want to cover both some key information, how to run that in SAS, and the best practices– so just throwing that out there. All right, so choosing data– when choosing data, there are several things that you need to take into consideration.
Are you looking at doing a primary analysis or a secondary analysis? Primary analyses involve you’re actually going out and doing maybe a survey collection or collecting data in some way. You are the person that’s actually collecting and putting this data together. A secondary analysis is usually when you go out and you find that someone else has done, or maybe you have done it before, but you want to do a different analysis on it. So it’s data that’s been collected without the key purpose of the analysis you want to do on it. Both of those different types of data themselves are going to have different formats and caveats within them that you need to consider. There’s also experimental versus observational research.
Experimental is when you can actually randomize individuals into groups, and maybe give them a different drug or give them a different way of learning– that you’re able to actually test the difference between those two groups. In observational research, you don’t have the luxury of being able to randomize individuals and place them into the groups you want to see them in. You actually have to observe individuals or the subjects of your research in their natural environment or in a structured environment, but in their natural way.
It’s a type of research you can’t manipulate, but in and of itself has strengths, because you’re actually able to see them without the pressure of that experimental aspect. Another thing you want to consider is, does the data ask or answer the right questions? So the questions that you might not be the questions that the data is ready to answer– so just making sure that the questions you have to match the type of data that you’re looking to collect. Another thing to consider is also data structure.