AI in Data Analysis

Author

Štěpánka Pustková, Branislav Kopčan, Klára Klinkovská, Irena Axmanová

Ethics

AI tools, and especially LLMs (large language models) have been widely available for just a few years, basically for just a little while. However, they have managed to gain a die-hard fan base, as well as a group of people who see them nearly as a harbinger of the end of the world, or at least of the world as we know it (Kokotajlo et al., 2025). So, everything we are trying to point out here may soon become completely irrelevant…

Many people use it regularly, including the authors of this leaflet. The most common things we use it for are coding tasks (mainly script debugging), literature search, summarising long texts, explaining things or text editing. AI tools can give better search results than traditional search engines, can help with learning new analytical methods, new programming languages, and it can also make interdisciplinary research easier (Mammides & Papadopoulos, 2024; RStudioDataLab, 2023). We want to encourage you to play, experiment with AI tools, and try to understand how they work and where the limits of their usage lie, too. But keep in mind that the content of your conversations with AI tools (i.e. data, information about yourself) may be used for further improvement of the model, especially in free versions. So, even if some tools (e.g. Chat GPT) allows you to turn off chat history or delete your conversations, think twice before you share anything and do not send sensitive data to LLMs.

One of the important (and still not really resolved) issues of extensive usage for LLMs relates to copyright. If you let an LLM write a homework, an essay or even a whole article, who is actually the author of the whole thing? Is this a case of plagiarism? Isn’t it fair to acknowledge the authorship of the AI tool, too? How? Should, e.g., ChatGPT be credited as high as a co-author? Who is responsible for the correctness and accuracy of such text (The group for AI in teaching at Masaryk University, 2023; Wu et al., 2024)? It is also useful to realize that the quality of an AI-generated text originates from thousands of well-written texts by professional writers (Wu et al. 2024) whose work is usually not credited in the AI output. When you are a researcher, such things as true authorship of ideas, text, credibility and authenticity of your outcomes will be the questions you will definitely encounter (Johnson et al., 2024; The group for AI in teaching at Masaryk University, 2023; Wu et al., 2024). It might (or might not, who knows) also happen that large (uncredited) usage of LLMs in writing research papers can be reviewed as unacceptable in the future, and as such, it might even discredit your whole research work (Johnson et al., 2024). Many journals and institutions (even MUNI, check this link) have prepared guidelines regarding AI usage and its reporting. Usually, usage of AI must be reported when it is used to write longer parts of text (e.g., abstract), or to analyse or visualise data. However, broader consensus on AI use reporting is still missing and therefore it is necessary to check the specific rules and conditions set by each institution, journal, or lecturer before using AI tools.

Other ethical issues come up directly from the generated content. LLMs can suffer from issues such as training data poisoning or improper data sanitization, which may lead to significant biases in the outcomes (Johnson et al., 2024; Wu et al., 2024). LLMs are also known to hallucinate non-existent stuff, e.g. literature references or names of software packages. Especially in the case of software packages, this can pose a considerable security risk. Such packages can then be published by hackers with harmful code, which you then unsuspectingly run on your own computer (Françoisn - f@briatte.org, 2025; Mammides & Papadopoulos, 2024; Wu et al., 2024).

Heavy reliance on AI tools may also prevent you from learning crucial research skills, such as formulating your own ideas, deep understanding of your topic, various analytical approaches, critical reasoning, etc. (Johnson et al., 2024; Millard et al., 2024). The usage of AI tools also raises equity questions, especially in regard to paid versions. Naturally, LLMs will be more important for non-native English speakers, but the access to higher-level—but often paid—tools is not available for all, especially for people from low-income countries (Campbell et al., 2024; Wu et al., 2024).

The flip side of AI tools also includes their environmental impact. Training and tuning a single LLM produces more CO2 emissions than the average American does in his entire lifetime (Strubell et al., 2019). Using LLM chatbots for one year generates 25 times more CO2 emissions than training the GPT-3 model. This does not even consider the infrastructure and equipment needed, which requires a lot of water and mining of rare elements, causes contamination, and more (Chien et al., 2023; Cooper et al., 2024).

Principles of LLMs

Many people often talk about LLMs as if they were real people. We all know, or perhaps even use, phrases like “ChatGPT told me…” or “He thinks that…”. Some people are used to having long conversations with them, and some even use them as psychotherapists (e.g., Lau et al. (2025)). However, the principles of how LLMs work are far from what we would consider “thinking”.

The basic principle of LLMs is predicting the next word, or more precisely, the next “token”, based on the sequence of the previous ones. Each token—typically a word or part of a word—is first transformed into an array of numbers (an embedding). Then the model assigns probabilities of occurrence to each given token, based on its occurrence in other situations. Based on the probabilities, a token is selected and added to the string, and the process starts again. Interestingly, the best-fitting word is not always selected. Instead, it often samples from a distribution of more, but still highly probable options, to introduce some variability. This way, the model produces what looks like a “reasonable continuation” of a text (Stanford CS324, 2022; Wolfram, 2023).

And this is where it gets complicated. The “reasonable continuation” means “what a human would expect a text to look like”. It is derived from billions of websites, articles, and books, which were at some point downloaded from the internet and broken into tokens. The model learns co-occurrence patterns, context and sequences of tokens, which are then turned into probabilities (Stanford CS324, 2022; Wolfram, 2023). To put things in some perspective, the English language corpus includes around 40,000 words. However, even though those billions of web pages may seem enough for calculating probabilities of all the tokens, for many rare tokens, it is still not enough. This is where large neural networks, machine learning, and optimization algorithms step in (Wolfram, 2023).

You might have heard about general linear models, which are often used in ecology. Such models typically include a few parameters, and adding new terms (and parameters) can soon make the model quite demanding for computational capacity and also for the size of the dataset. The most advanced version of the GPT-3 model (not used anymore, but the most advanced model which the number of parameters and other information is publicly available for) has 175 billion parameters, its size is about 350 GB and the required RAM capacity is ca 300-700 GB (Brown et al., 2020) which is far beyond what a standard personal computer is able to run currently.

This part was consulted with ChatGPT for a correct description of terms and processes that I have often never heard about before and have just a very little idea how they really work.

Tips for using LLMs

Here, we put together a list of (hopefully) useful tips for LLM (especially LLM-based chatbots, such as ChatGPT or Gemini) usage:

Different tools are useful for different tasks

ChatGPT: Summaries, explanations, research, presentations, coding.

Google Gemini: Helping with code, writing and research.

Microsoft Copilot: Writing, researching, coding, brainstorming, summarizing.

Le Chat Mistral: Quick research, writing, brainstorming, summarizing, learning.

Perplexity: My strengths: information retrieval, text generation, problem-solving, language understanding, multitask support, creativity.

Claude.ai: Writing, coding, analysis, research, problem-solving.

These answers were generated by each AI tool. The prompt was: “Can you tell me what are your strengths? Like for what tasks people use you the most? Give me a 5-6 word answer.”

Other AI tools without chat-like interface:

DeepL: translator, AI-powered text editing

QuillBot: translator, grammar-checker, paraphraser

Grammarly: text editing

Elicit: literature search

Gamma: presentations

When you want to compare more AI tools:

LMArena: side by side comparison of different LLMs, up-to-date lists of best performing models in various tasks

In general

Practical tips for data analysis and coding

  • Prior knowledge of coding and statistics is necessary to provide sufficiently detailed prompt and use and interpret AI-generated code correctly (Campbell et al., 2024)

  • Specify the coding language, packages you want to use (e.g., tidyverse, vegan) (Cooper et al., 2024; Vieira & Raymond, 2025)

  • If you analyze the data with an LLM, always ask the LLM to generate an R script and test its functionality. Do not just pick the results and export the figures. This is essential for reproducibility and further adjustments. 

  • Use full sentences, with as much context as possible (Cooper et al., 2024)

  • If you are trying to resolve an error, copy not only the error message, but also the whole script where the error emerges from. This gives the LLM a context of the functions and libraries you used (Ellen, 2025)

  • “Effective prompts include context, specify the topic, outline the desired output, concise, focused question” (Lubiana et al., 2023)

  • When the conversation leads nowhere, start a new chat. Try to give the LLM a different context (Willison, 2025b)

  • It does not always change only the part of the script you ask it to, but also other parts without warning (Çetinkaya-Rundel, 2025)

  • Always check the script you want to apply (Willison, 2025a). Clean the suggested script from unnecessary parts, try to run it line-by-line to understand what is happening (Çetinkaya-Rundel, 2025)

  • From time to time, consult also old-school browsers or StackOverflow, because LLMs can lack information about the latest updates. Also, you can find better (or different) approaches how to do certain things there (Ellen, 2025; Willison, 2025b)

  • Before you start questions (Davjekar, 2024):

    • “I’m starting a new [type of project] using [programming language/framework]. Can you suggest a basic file structure and essential dependencies I should consider?”

    • “I want to build [brief project description]. Can you help me break this down into smaller tasks and suggest an order of implementation?”

  • Questions to ask about the code (RStudioDataLab, 2023):

    • Why did you generate this code?

    • What does this code do?

    • How can I fix this error?

    • What are the alternatives to this code?

    • How can I improve this code?

  • Handy prompts (Lubiana et al., 2023):

    • “Add explanatory comments to this code:”

    • “Rename the variables for clarity:”

    • “Write me a standard GitHub README file for the above code.”

    • “Extract functions for increased clarity:”

    • “Re-write and optimize this for-loop:” 

    •  “Write me regex for R/Python/Excel with a pattern that will extract {} from {}”

    • “Create a ggplot2 violin plot with a log10 Y axis”

References

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Campbell, H., Bluck, T., Curry, E., Harris, D., Pike, B., & Wright, B. (2024). Should we still teach or learn coding? A postgraduate student perspective on the use of large language models for coding in ecology and evolution. Methods in Ecology and Evolution, 15(10), 1767–1770. https://doi.org/10.1111/2041-210X.14396
Çetinkaya-Rundel, M. (2025). Learning the tidyverse with the help of AI tools. https://www.tidyverse.org/blog/2025/04/learn-tidyverse-ai/
Chien, A. A., Lin, L., Nguyen, H., Rao, V., Sharma, T., & Wijayawardana, R. (2023). Reducing the carbon impact of generative AI inference (today and in 2035). Proceedings of the 2nd Workshop on Sustainable Computer Systems, 1–7. https://doi.org/10.1145/3604930.3605705
Cooper, N., Clark, A. T., Lecomte, N., Qiao, H., & Ellison, A. M. (2024). Harnessing large language models for coding, teaching and inclusion to empower research in ecology and evolution. Methods in Ecology and Evolution, 15(10), 1757–1763. https://doi.org/10.1111/2041-210X.14325
Davjekar, A. (2024). AI-assisted software development: A comprehensive guide with practical prompts (Part 1/3). https://aalapdavjekar.medium.com/ai-assisted-software-development-a-comprehensive-guide-with-practical-prompts-part-1-3-989a529908e0
Ellen, L. (2025). How would I learn to code with ChatGPT if I had to start again. In Towards Data Science. https://towardsdatascience.com/how-would-i-learn-to-code-with-chatgpt-if-i-had-to-start-again/
Françoisn - f@briatte.org. (2025). AI-generated code comes with security risks R-bloggers. https://www.r-bloggers.com/2025/04/ai-generated-code-comes-with-security-risks/
Johnson, T. F., Simmons, B. I., Millard, J., Strydom, T., Danet, A., Sweeny, A. R., & Evans, L. C. (2024). Pressure to publish introduces large-language model risks. Methods in Ecology and Evolution, 15(10), 1771–1773. https://doi.org/10.1111/2041-210X.14397
Kokotajlo, D., Alexander, S., Larsen, Th., Lifland, E., & Dean, R. (2025). AI 2027. https://ai-2027.com/race#narrative-2025-08-31
Lau, Y., Ang, W. H. D., Ang, W. W., Pang, P. C.-I., Wong, S. H., & Chan, K. S. (2025). Artificial intelligence-based psychotherapeutic intervention on psychological outcomes: A meta-analysis and meta-regression. Depression and Anxiety, 2025(1), 8930012. https://doi.org/10.1155/da/8930012
Lubiana, T., Lopes, R., Medeiros, P., Silva, J. C., Goncalves, A. N. A., Maracaja-Coutinho, V., & Nakaya, H. I. (2023). Ten quick tips for harnessing the power of ChatGPT in computational biology. PLOS Computational Biology, 19(8), e1011319. https://doi.org/10.1371/journal.pcbi.1011319
Mammides, C., & Papadopoulos, H. (2024). The role of large language models in interdisciplinary research: Opportunities, challenges and ways forward. Methods in Ecology and Evolution, 15(10), 1774–1776. https://doi.org/10.1111/2041-210X.14398
Millard, J., Christie, A. P., Dicks, L. V., Isip, J. E., Johnson, T. F., Skinner, G., & Spake, R. (2024). ChatGPT is likely reducing opportunity for support, friendship and learned kindness in research. Methods in Ecology and Evolution, 15(10), 1764–1766. https://doi.org/10.1111/2041-210X.14395
RStudioDataLab. (2023). How to use ChatGPT for data analysis in R. In Medium. https://rstudiodatalab.medium.com/how-to-use-chatgpt-for-data-analysis-in-r-891372af842
Stanford CS324. (2022). Introduction. In CS324. https://stanford-cs324.github.io/winter2022/lectures/introduction/
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1355
The group for AI in teaching at Masaryk University. (2023). Statement on the application of artificial intelligence in teaching at Masaryk University. In Masaryk University. https://www.muni.cz/en/about-us/official-notice-board/statement-on-the-application-of-ai
Vieira, V., & Raymond, E. S. (2025). AI code guide. https://github.com/automata/aicodeguide
Willison, S. (2025a). Hallucinations in code are the least dangerous form of LLM mistakes. In Simon Willison’s Weblog. https://simonwillison.net/2025/Mar/2/hallucinations-in-code/
Willison, S. (2025b). Here’s how I use LLMs to help me write code. In Simon Willison’s Weblog. https://simonwillison.net/2025/Mar/11/using-llms-for-code/
Wolfram, S. (2023). What is ChatGPT doing … and why does It work? Stephen Wolfram Writings. https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
Wu, X., Duan, R., & Ni, J. (2024). Unveiling security, privacy, and ethical concerns of ChatGPT. Journal of Information and Intelligence, 2(2), 102–115. https://doi.org/10.1016/j.jiixd.2023.10.007