Using ChatGPT4 for data analysis
Can you use ChatGPT4 – and custom GPT’s – for investigative journalism, for example, data analysis – most definitely yes! (but be careful): Here is what I have learned converting 39 JPEGs into an all-text Excel file.
TL;DR version:
- ChatGPT handles .xlsx files (Excel), significantly better than CSV-files.
- It handles transcribing text from images well, with minimal errors (it even identified mistakes in the original poster’s translation).
- It is more adept at transcribing, sorting, and handling data than analysing it. Therefore, it should be used for structuring data, while you should be careful, and double check results during the analysis phase.
- For creating sheet-type files, ChatGPT4 performs better than other tools such as Claude (3 Sonnet), Gemini, and Co-pilot – mainly because it is better at creating a file, so you don’t have to copy-paste data from the LLM to the file.
- I tasked it with transcribing canteen menus and identifying meals containing pork. It did an okay job, when I asked it to look at the data as a data analyst, and a bit better, when I told it to look, at the menus as chef and gave it examples of food containing ham (eg. bacon), but for example it kept telling me, that fish fritters (da: fiskefrikadeller) was pork, because it translated fiskefrikadeller to fish meatballs and meatballs = pork.
- Utilising a custom GPT for the analysis simplifies the process, though errors such as forgetting custom instructions can occur.
- If you ask it to correct errors in the dataset it quite often do a good job.
Was it a more straightforward, faster, and superior method compared to manual analysis without AI? Yes, particularly after realising its limitations with CSV files.
Project background
I am continually seeking data projects to explore the capabilities and limitations of AI.
This seemed like a fun task: Every week the SDU main canteen post a JPEG on Facebook, with the menu for the following week. Me and my colleague Freja have tried to engage our students in analysing and scrutinising the menu (does it fit the strategy for climate and diversity eg.), but none of them seemed that interested, so… now I’ve done part of it myself. First a disclaimer: I only looked at the main canteen, there is an all-green canteen as well. So this is only part of the picture.
I decided to test if I could get ChatGPT4 to help me do it. And the answer is… kind of. So, if you want to know more about using a GPT for data analysis, come a long for the ride:
Analysing JPEGs and making a CSV-file
The canteen publishes their menus on Facebook – as a JPEG. Not very data friendly. So, I manually downloaded the JPEGs (boring, but easily done) – I ended up with 39 weeks from week 18 2023 to week 17 2024. And then I asked ChatGPT4 to transcribe the menus. It did an impressively accurate job.
The menu looked like this:
When transcribed, it looked like this:
I then asked it to organising the transcribed data into a CSV format. I requested five columns: week number, day, Danish text, English text, and allergens.
Very structured and nice. It managed to handle days without text (for weeks the canteen didn’t translate the menus and it hadn’t written allergen numbers in the same exact way every week, sometimes they weren’t there at all). And when I asked it to make a downloadable CSV-file, it did. All in all, quite impressive.
However, repeating the same task became monotonous, exacerbated by frequent errors, particularly when processing multiple JPEGs simultaneously.
Developing a Custom GPT in ChatGPT4
Then I considered: Can I create a GPT (a ChatGPT robot with predefined instructions) and instruct it to transcribe every JPEG, leave sections blank, if there isn’t any available data, move the allergens to the end row (even if they are not written correctly), and compile every dataset into the same CSV file.
The outcome was partial success. The primary challenge was the GPT’s occasional forgetfulness of prior instructions and its inability to add data to an existing file, though it managed to create a new file satisfactorily.
Errors I encountered:
- When it got stressed, it made an illustration (!) of the menu instead of a CSV file – then I turned Dall-E off in its instructions. That helped.
- It got confused because it itself had translated the headers to English, and I’d given it a file with Danish headers matching the once at the menus. But it taught itself to correct this…
- When I tasked it with processing four JPEGs simultaneously, it disrupted the data, leaving out some days and misordering others. This underscores the necessity of human oversight to ensure accuracy. Consequently, I reverted to processing one week at a time. Upon requesting it to generate a new CSV with the corrected sequence, it successfully complied.
- I encountered numerous errors where it seemed to forget my instructions repeatedly. Therefore, during these intervals when the AI is processing—or ‘thinking’—it may be prudent to engage in other activities, as waiting can lead to inefficiencies. Transitioning to an .xlsx format significantly accelerated the process compared to using a CSV file.
- Initially, it claimed it could not create a downloadable CSV file, although it had successfully done so previously. Therefore, I deduced that a GPT could indeed accomplish this task. After revisiting and revising the instructions to affirm its capability, the system acquiesced and executed the command.
- Subsequently, I reached the data limit, necessitating a pause of an hour or two before proceeding.
- And then when I came back, it had forgotten how to do, what it had been doing for hours.
I had completed 19 weeks by then, but when I switched to .xlsx format, I ended up processing each week again, simply because it felt so satisfying to see how smoothly it consolidated them all into the same .xlsx file.
XLSX Instead of CSV
Initially hesitant to switch file types, I ultimately found the transition from .csv to .xlsx not only manageable but significantly beneficial.
This is how the errors looked, for the csv-file over and over (it’s in Danish, I mainly wrote everything in English, but at some point, I got tired and annoyed and switched to Danish – I’m sure you catch the drift, even if you don’t know the language).
Then I modified the GPT, adding a .xlsx file to its instructions and converting the format from .csv to .xlsx. This adjustment was straightforward and made a significant difference.
I encountered only 2-3 errors throughout the entire 39 weeks. I still had to add them one week at a time. Writing “Now do week 19” and upload the JPEG. Manageable, but a bit boring.
By the end it looked like this:
Now for the actual analysis
I then asked ChatGPT4 to function as a data analyst to identify patterns and propose potential stories for the university newspaper. While it performed adequately, the insights were not beyond what could be achieved through a basic brainstorming session.
Then I tried to ask it to emphasise every meal containing pork with pink text:
It wasn’t impressive.
Apparently butter chicken is pork. While pulled pork isn’t.
When I changed my prompt to ‘act like a chef and look for pork (eg. bacon, ham)’ it went a bit better. I still had to read it myself and correct a few errors (eg. translating fiskefrikadeller to fish meatballs instead of fish fritters and then stating meatballs equals pork).
But all in all:
- Transcribing the JPEGs manually would have been considerably more time-consuming. While focusing solely on counting pork meals might have allowed for quicker completion, this approach would not have provided the complete dataset.
- Creating a GPT for the .xlsx format streamlined the process significantly, 10/10 would recommend!
So if any of my students want to do the actually journalism, they can download the file here (if allergens or English translation isn’t provided, it wasn’t provided by the canteen).
Comparing Other AI Tools
I tested Claude (3 Sonnet), Co-pilot, and Gemini. Although as well, my testing was not exhaustive, but initial impressions suggest:
- Claude effectively provided the data but could not generate a spreadsheet, so you would have to copy/paste all of the data
- Co-Pilot initially refused to perform the task but has since proven capable, though manual data transfer remains necessary as well
- Gemini executed the task smoothly, even creating a Google Sheet with the data. However, it struggled to add additional data to an existing sheet (I’m sure they’re going to fix that soon!)
All in all, AI can be somewhat buggy regardless of the bot you choose, requiring patience and creativity to devise workarounds when issues arise. However, when used effectively, it can significantly simplify sorting data.