Twelve (Mis)Steps from Sober Assessments: Confessions of a Failed OIE Assessor

craiyon_222536_operations_in_the_information_environment

By Major Kara Masick

On August 24, 2022, an “Unheard Voice” was heard around the world. It was the first open-source research ‘outing’ of its kind; Graphika and the Stanford Internet Observatory (SIO) produced a report on a covert pro-Western social media influence operation from datasets that Twitter and Meta removed for violating platform policies.^[1] The report claimed to be “Evaluating…information operations”, and a subsequent Washington Post editorial stated, “the Pentagon will conduct a sweeping review of its policies regarding clandestine information warfare”.^[2] In addition to potential ethical concerns, this means one thing^[3] to an OIE^[4] operator: assessments.

Foolproof attribution of these operations to an actor will never occur, but frankly, attribution – accurate or not – has occurred. Due to the nature of most OIE, their assessments can be paradoxical in a similar manner to OIE attribution. Attributing effects to OIE both cannot happen and yet simultaneously does happen. Complete attribution of DoD mission-related effects to most of our OIE will never occur, yet we can work to decrease the uncertainty inherent in these assessments to support effective decision-making.

Other than a description of the data handed to them (in OIE terms, MOP^[5]), the main claim made by SIO’s report is that “the data also shows the limitations of using inauthentic tactics to generate engagement and build influence online”. My initial response was, ‘Well, it seems like everyone has influence objectives.’^[6] More power to them. Their objectives are quite noble and less ‘messy’ than the ones delegated to USCENTCOM.^[7] Yet, even the summary assessment of the two leading social media analysis businesses is, essentially, a weaker version of the claim that inauthentic tactics online cause the effect of limitations on engagement and influence. They also assume MOE^[8] to be lacking due to a crude correlation between authenticity of persona and amount of engagement^[9] in the form of ‘follows’ [that we often call MOEi^[10]]. Their ‘evaluation’ was based on a limited, nonrandom^[11] sample of the greater campaign(s), and they do not have the full, classified story.

Another unclassified assessment was published in 2011 evaluating a DoS^[12] initiative that was also trying to “come across as authentic individuals and ‘real’ persons” in online Arabic dialogue. It states that their target audience, “…leave no evidence of their reactions online. It is therefore possible that lurkers are influenced by seeing anti-US views challenged online but it is impossible to be sure whether they are convinced by the DOT^[13] arguments or not.”^[14] Their assessment ‘justifies’ this lack of MOE and MOEi by probably accurately referencing “…Noelle Neumann’s (1984) “spiral of silence” theory, which says that people are less likely to voice their opinions in public if they believe those opinions are in the minority…”.^[15]

Were the authors of Unheard Voice even attempting an exhaustive evaluation of the effects of the pro-Western information operations? Probably not. It would have been nice if they hadn’t implied a lack of influence^[16] for the Washington Post to ‘take and run with’. No one blames any of these assessors for failing to identify the ‘complete’ impact of these operations. Everybody fails at this; I’m a failed OIE assessor myself. I’m on a journey towards OIE assessment recovery, and you’re welcome to join me. Like an alcoholic seeking sobriety by first admitting they have a problem and acknowledging where they’ve gone wrong, the rest of this essay is 12 confessions of my OIE assessment failures.

I’ve had unscientific expectations for OIE assessments.

As much as I’d love to be able to prove the effectiveness of OIE, let’s start our recovery journey – together – with science-based expectations for OIE assessments. In OIE assessments, we’re often trying to infer a causal relationship from MOP to MOE. Science tries to identify causes and their effects, and an effect is the difference between what did happen and what would have happened if the proposed cause did not occur (the counterfactual).^[17] OIE effects, then, are the difference between what happened and its counterfactual (what would have resulted if the specific operation never occurred). Experiments are often the best method for trying to, “create reasonable approximations to this physically impossible counterfactual.”^[18] In hard sciences, the counterfactual is more easily approximated, but we can never compare real-world OIE to a true counterfactual.^[19] A test tube can be a counterfactual (prepared as an exact replica of the one receiving the treatment), but there is no exact replica of a target audience that does not encounter your OIE (or other potential causes/factors) to compare to the audience members who do.

Furthermore, John Stuart Mill posited that a causal relationship exists if: “1) the cause preceded the effect, 2) the cause was related to the effect, and 3) we can find no plausible alternative explanation for the effect other than the cause.”^[20] A correlation between our OIE/MOP and its effect/MOE helps us out with #2, but not #3 (correlation doesn’t equal causation). Experiments try to assist with #1 and #3. On occasion, it can be hard to establish the temporal relationship between a desired behavior and the OIE, but what is most baffling for OIE assessors is the “no plausible alternative” aspect.

I admit I am powerless over alcohol confounding variables.

This is why the soft sciences are “often harder than hard sciences” (along with the challenges of operationalizing and measuring our constructs).^[21] Let’s say a public health organization is trying to determine the effectiveness of a messaging campaign to increase handwashing. They may have trouble discerning their impact from that of other influences on their audience, and to add insult to injury, COVID-19 happens mid-campaign.

We aren’t completely powerless, but the ‘fixes’ are neither perfect nor very practical when it comes to real-world OIE. To control for the variance created by confounding factors (both moderating and mediating variables) means to either create an experiment that keeps them constant or to intentionally measure them to statistically control for their variance. This allows you to imperfectly correct for confounding factors (current events, other influencers, etc.) and have a better calculation of the variance in MOE that’s based on MOP.

The method with the greatest likelihood of establishing causal relationships are randomized experiments, especially the “pretest-posttest control group design”.^[22] They control for confounding variables by either keeping them consistent or intentionally measuring them in a controlled environment. Even the greatest randomized lab experiment has wonderful internal validity^[23], but there are always threats to external validity^[24]– employment in the field.^[25] Randomized field experiments are the recommended best practice for real-world applied social action, but I bet there are policy limitations on that for OIE. There are more threats to internal validity in the field and still many threats to external validity.

I’ve given up on assessments.

Don’t lose hope. We must continue to investigate the effectiveness of our operations. Furthermore, convincing attribution of OIE effects has still occurred (see # 6 for how). Effective assessment of OIE can be done even if I’m promoting a very nuanced, accurate understanding of the limits of attributing OIE effects. We can substantially decrease the uncertainty inherent in this sort of work to support effective decision-making and to plan improved future operations.

I’ve communicated unrealistic expectations for OIE assessments.

Science-based assessment expectations can correctly calibrate operator action and communication with DoD decision-makers which can even facilitate mutual understanding and accountability with other government agencies. An added benefit for operators: these expectations are more reasonable than ones more akin to material science causal attribution.

I didn’t know about program evaluation.

We can improve assessments through learning from program evaluation.^[26] We can capitalize on the wisdom and experience of decades of applied social science research towards assessing the effectiveness of government programs. They, as a field, learned the lessons of the first two confessions, and even more like Rossi’s Iron Law,^[27] the hard way, and they have continued to fight back. They theorized other goals for evaluations, and they developed other methods to enable the use of their assessment results.^[28] For example, they have developed very practical, useful methods to use when appropriate such as rapid feedback evaluations and evaluability assessments.^[29]

I’ve thought about causal attribution when I should be playing with my kids.

I’m not proud of it, but in an effort to redeem a bit of that lost time, I’ll share some of those thoughts below.

If we collect MOE, to what extent can we attribute it to OIE?

Conceptually, we can attribute MOE (effects) to an operation (or campaign) in the information environment to the extent that we have inferred:

…the MOE were not likely if the operation had not occurred (the counterfactual)
…the effects produced (MOE) are over and above what would have occurred without the operation.^[30]

Practically, to the extent that we have:

…controlled for potential alternative explanations to the MOE
…replicated the MOE
…persuaded relevant decision-makers
…inferred the operation to be the most likely of potential causes

They will be assessments made with “varying degrees of confidence”^[31]; probability theory and statistics will likely be useful.

I’m about to propose yet another DoD acronym.

Relating to “d)” above, I will propose MLC: Most Likely Cause.^[32] We could infer that our OIE is the most likely cause of the intelligence we collect (perhaps multiple MOEi). Assessors should learn to think like an IT Helpdesk professional troubleshooting based off the most likely cause of our computer issue. Like detectives discovering clues to find the most likely culprit, we can weigh relative likelihoods that our OIE had the effect compared to other potential causes.

We can make a list of potential causes for a change in the IE^[33] or audience behavior and infer whether the OIE fits into the “characteristic causal chain” of these effects or if it is the most likely of a partial causal list.^[34] Applying the Modus Operandi Method of program evaluation, we can also include signatures or ‘tracers’ within our OIE (very specific verbiage, an image, etc.) that intel can follow through the IE to “assess deterioration, implementation, and so forth.”^[35]

I’ve stayed on The Strip when TDY to Nellis AFB.

It was cheaper, and OIE professionals need a strong, applied grasp of probabilities and likelihoods. Actually, let’s do this together, frequently.

Bayesian statistics specifically could improve OIE. Put simply, “Bayesian data analysis takes a question in the form of a model and uses logic to produce an answer in the form of probability distributions.”^[36] We could measure and calibrate relative probabilities/likelihoods of effective OIE (e.g.- using formulas developed by Venhaus et al.^[37] and Smith et al.^[38]) to finely calibrate the decrease in uncertainty for decision-makers and iteratively improve future operations through ‘learning’ models.

I’ve prioritized looking better over getting better.

I’ve tried to appear as if my operations are perfect or that I know they are accomplishing their objectives. I believe this typically stems from a desire to prove OIE’s worth in an arsenal that includes the most immediate, kinetic DoD fires.^[39] As the field of program evaluation matured, many of the best veteran assessors became convinced of the primacy of formative evaluation^[40] over summative evaluation.^[41]^,[42] Prioritizing improving OIE over proving it will also foster assessment practices that work to identify a more accurate picture of our OIE’s impact instead of a photoshopped one.

I’ve tried to be both the operator and my own evaluator.

This is not a sin, yet I’ve only recently learned that effects-based assessment is not advised. I wonder how many OIE operators and planners are spun up on assessment best practices, including many mentioned by the Colonels in the article cited in this endnote.^[43]

A dedicated assessments unit of experienced OIE operators focused on improving operations with cognitive domain components could do wonders. They’d have insider knowledge and a vested interest in DoD OIE success, without assessing their own plans and operations. They’d have greater incentives to collect time series data and facilitate good assessment principles that are hard for operators like ‘fail fast’.^[44] With evaluations needing to balance rigorous assessments and practical constraints, this can optimize resource efficiency. They could be trained in research methods, statistics, and program evaluation and be the keepers of institutional knowledge, responsible for creating and iteratively improving OIE assessment performance aids/tools/checklists/SOPs/best practices. They could aid surveying/polling audiences and pretesting products. They’d be situated for meta-analysis of all sorts of OIE across different missions, Joint units, levels (tactical, operational, and strategic).

Meanwhile, for dual OIE operators and assessors, see my other endnotes for resources and RAND’s list in this endnote.^[45] We can also follow this advice from a veteran program evaluator: “I think the time has come to change our orientation in the development of social science away from the goal of abstract, quantitative, predictive theories toward specific, qualitative, explanatory checklists and trouble-shooting charts.”^[46] We in the DoD do this well; it’s part of our culture. We can use checklists, as program evaluators do, to improve the “validity, reliability, and credibility” of evaluations,^[47]^,[48] capture best practices, and iteratively improve them.

I’ve compared my progress to others.

The recent ‘outing’ report implied we operate like our adversaries … we don’t. If we did, we’d be better at OIE. We don’t compete with our adversaries in the IE by playing their game. We can improve how we do this in the American way. In a marketplace of information environment ops, strategies, and tactics, we can assess which ones are better than others. Don’t look to other countries as examples; don’t compare us to them. We can compare ourselves to our own past and make measured, often iterative improvements.

I’ve likely failed.

But I’ll pick myself back up and continue to work to improve OIE, not prove it. I’ll work to decrease the uncertainty inherent in OIE to support our Commanders to make decisions that will improve future operations. I’ll beware of certainty where there is none, and I’ll get comfortable living within and communicating about the land of likelihoods and probabilities.^[49] I’ll continue working on proactive, ethical, modern, effective OIE.

The OIE battlefield is littered with ‘dead’ personas, removed from platforms. It is extremely likely that some of them were shouting into the ‘wind’, vast echo chambers of silence. Messengers sent out and never heard from again. They live and we operate in that uncertainty. Recognize it, but don’t get too comfortable there. Work to decrease it and to illuminate the vast darkness.

We all fail sometimes. What do Americans do when we fail to be heard? We talk louder. Our speech is often obnoxious (on public transport overseas), too ‘forward’, likely presumptive, possibly offensive, and free. It’s certainly freer (cheaper) than other, more kinetic DoD fires. I’ll continue working to improve OIE, not to prove it. I’ll try to repent from my 12 confessions above, call a friend when I want to relapse, pray to God, etc. – whatever I need to do to make sober assessments.

Footnotes

[1] Graphika and Stanford Internet Observatory (2022). Unheard Voice: Evaluating five years of pro-Western covert influence operations. Stanford Digital Repository. Available at https://purl.stanford.edu/nj914nx9540

[2] The Editorial Board. (2022, September 20). Pentagon’s alleged secret social media operations demand a reckoning [Opinion]. The Washington Post. https://www.washingtonpost.com/opinions/2022/09/20/military-pentagon-fake-social-media/?utm_source=twitter&utm_medium=social&utm_campaign=wp_opinions

[3] I chose to omit strategic communications-related policy concerns as they are above the paygrade of this paper’s primary audience [the OIE operational community], not to mention a time and attention resource ‘luxury’ that those working active national defense-related missions can’t afford. Certainly, OIE operators will do our best to abide by future policy implications if any novel operational limitations result.

[4] Operations in the Information Environment; This paper’s focus is the application of informational (not physical) power “…to affect the observations, perceptions, decisions, and behaviors of relevant actors…” especially communication with inform, influence, and persuade objectives – CJCS’s JCOIE, 25 Jul 2018, p.24.

[5] Measures of Performance

[6] Later fleshed out a bit more as: Everyone within information environment-related professions, potentially everyone (including the authors of this report), has their own (personal or organizational) influence-related objectives.

[7] ‘No lying’ is a lot simpler than ‘No terrorism’ (as the simplest version of a hypothetical USCENTCOM objective I can think of at the moment).

[8] Measures of Effectiveness: a criterion used to assess changes in system behavior, capability, or operational environment that is tied to measuring the attainment of an end state, achievement of an objective, or creation of an effect” and that “describes what the specific target [audience] needs to do to demonstrate accomplishment of a desired effect. -Headquarters, U.S. Department of the Army, Inform and Influence Activities, Field Manual 3-13, Washington, D.C., January 2013a.

[9] Social media engagement primarily comparing the number of followers [also listing counts and averages of likes and retweets, selectively referencing either the “whole” data portion provided to them or only the “covert” accounts]

[10] Measures of Effectiveness indicators

[11] To the best of my knowledge; my apologies if I’m incorrect and Meta or Twitter and/or these companies pulled a random sample to make these datasets.

[12] Department of State

[13] Digital Outreach Team (of the State Department)

[14] Khatib, L., Dutton, W., & Thelwall, M. (2011). Public Diplomacy 2.0: An Exploratory Case Study of the US Digital Outreach Team. Middle East Journal, 66. https://doi.org/10.2307/23256656, p. 13-14.

[15] Khatib, L., Dutton, W., & Thelwall, M. (2011). Public Diplomacy 2.0: An Exploratory Case Study of the US Digital Outreach Team. Middle East Journal, 66. https://doi.org/10.2307/23256656, p. 12.

[16] Summative Evaluation of the OIE, defined later in the essay [Evaluating the effectiveness of the OIE, relating to outcomes/impact/effects to the operations]

[17] Cook, T. D., Campbell, D. T., & Shadish, W. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin. p. 1-5.

[18] Cook, T. D., Campbell, D. T., & Shadish, W. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin. p. 5.

[19] A test tube can be a counterfactual (prepared as an exact replica of the one receiving the treatment), but an exact replica of a target audience that does not encounter your OIE (or other potential causes/factors) to compare to the audience members who do, does not exist.

[20] Cook, T. D., Campbell, D. T., & Shadish, W. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin. p. 6.

[21] Diamond, J. (1987). Soft sciences are often harder than hard sciences. Discover. https://content.csbs.utah.edu/~cashdan/fieldmeth/diamond_soft_sciences.pdf

[22] Campbell, D., & Stanley, J. (1963). Experimental and Quasi-Experimental Designs for Research (1st ed.). p. 14.

[23] Internal validity: “Did in fact the experimental treatments make a difference in this specific experimental instance?”

[24] External validity: Generalizability: To what populations, settings, treatment variables, and measurement variables can this effect be generalized?

[25] Campbell, D., & Stanley, J. (1963). Experimental and Quasi-Experimental Designs for Research (1st ed.). p. 5.

[26] “Program evaluation is the use of social research methods to systematically investigate the effectiveness of social intervention programs in ways that are adapted to their political and organizational environments and are designed to inform social action to improve social conditions.” – Rossi in Evaluation: A Systematic Approach

[27] Rossi, P. H. (2004). Evaluation: A systematic approach (7th ed.). Sage Publications. p. 16

[28] Shadish, W. R. (1991). Foundations of program evaluation: Theories of practice. Sage Publications.

[29] Shadish, W. R. (1991). Foundations of program evaluation: Theories of practice. Sage Publications. p. 225.

[30] Rossi, P. H. (2004). Evaluation: A systematic approach (7th ed.). Sage Publications. p. 253.

[31] Rossi, P. H. (2004). Evaluation: A systematic approach (7th ed.). Sage Publications. p. 252.

[32] Most Likely Cause

[33] Information Environment

[34] Scriven, M. (1974). Maximizing the power of causal investigations: The modus operandi method. In Evaluation in education (pp. 68–84). McCutchan Publishing. p. 72-73.

[35] Scriven, M. (1974). Maximizing the power of causal investigations: The modus operandi method. In Evaluation in education (pp. 68–84). McCutchan Publishing. p. 76.

[36] McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd ed.). p. 11.

[37] Venhaus, M. et al. (2021). Structured Process for Information Campaign Evaluation (SP!CE): An Analytic Framework, Knowledge Base, and Scoring Rubric for Operations in the Information Environment. MP210039. Approved for Public Release; Distribution Unlimited. p. 6-1 to 6-6.

[38] Smith, S. T., Kao, E. K., Mackin, E. D., Shah, D. C., Simek, O., & Rubin, D. B. (2021). Automatic detection of influential actors in disinformation networks. Proceedings of the National Academy of Sciences, 118(4), e2011216118. https://doi.org/10.1073/pnas.2011216118; Smith, S. T., Kao, E. K., Shah, D. C., Simek, O., & Rubin, D. B. (2018). Influence Estimation on Social Media Networks Using Causal Inference. 2018 IEEE Statistical Signal Processing Workshop (SSP), 328–332. https://doi.org/10.1109/SSP.2018.8450823

[39] Fires: “the use of weapon systems to create specific lethal and nonlethal effects on a target.” -JP 3-0

[40] Formative Evaluation: Evaluation activities undertaken to furnish information that will guide program improvement. – Rossi, 2004

[41] Summative evaluation: Evaluative activities undertaken to render a summary judgment on certain critical aspects of the program’s performance, for instance, to determine if specific goals and objectives were met. – Rossi, 2004

[42] Shadish, W. R. (1991). Foundations of program evaluation: Theories of practice. Sage Publications. p. 331.

[43] Arnhart, L. & King, M. (2018). Are We There Yet? Implementing Best Practices in Assessments. Military Review. https://www.armyupress.army.mil/Portals/7/military-review/Archives/English/King-Arnhart-Are-We-There.pdf

[44] Paul, C., Yeats, J., Clarke, C. P., Matthews, M., & Skrabala, L. (2015). Assessing and Evaluating Department of Defense Efforts to Inform, Influence, and Persuade: Desk Reference, Handbook, and Checklist. RAND Corporation. https://www.rand.org/pubs/research_reports/RR809z2.html

[45] Paul, C., Yeats, J., Clarke, C. & Matthews, M. (2015). Assessing and Evaluating Department of Defense Efforts to Inform, Influence, and Persuade: An Annotated Reading List. Santa Monica, CA: RAND Corporation. https://www.rand.org/pubs/research_reports/RR809z3.html

[46] Scriven, M. (1974). Maximizing the power of causal investigations: The modus operandi method. In Evaluation in education (pp. 68–84). McCutchan Publishing

[47] Scriven, M. (2005). The Logic and Methodology of Checklists.

[48] https://wmich.edu/evaluation/checklists/checklistsvalidation

[49] Rossi, P. H. (2004). Evaluation: A systematic approach (7th ed.). Sage Publications. p. 16.

About The Author

Kara Masick is an Air Force Information Operations officer (14F) with a passion for MISO and PSYOP mostly within Intelligence and Cyber organizations. She was the first 14F officially assigned to the J39 of the Information Warfare Numbered Air Force (16AF).

She was sponsored by USSOCOM to study Psychology and is doing that within the Measurement Research Methodology Evaluation and Statistics Lab at George Mason University (GMU) will the goal of improving her MISO operations and assessments abilities. Her dissertation research on persuasion uses Large Language Models to analyze text.

Opinions, conclusions, and recommendations expressed or implied above are solely those of the author and do not necessarily represent the views of The Air University, the United States Air Force, the Department of Defense, or any other US government agency.