Using text matching software to detect and deter plagiarism

Probably most instructors have to deal with plagiarism at some point, and all have to be prepared to. Especially nowadays, because technological advances have made it easy for students to copy the work of others. But technological advances also have made it easier to detect potential cases of plagiarism. I’ve been doing a bit of background research on text matching software and its use in detecting and deterring plagiarism. I wanted to share what I’ve found so far, and invite you to share your own experiences with such software.

What’s text matching software?

Text matching software compares electronically submitted assignments against a database of documents. Depending on the software, this database might include some or all of: other assignments, webpages, and other documents like scientific journal articles. Note that I said “text matching” and not “plagiarism detection”, since no software can distinguish between, e.g., properly cited quotations and plagiarism. Some packages can be customized in various ways, for instance by letting you choose the parameters governing the behavior of the text matching algorithm.

What text matching software is available?

Some commonly used text matching software packages (not an exhaustive list):

  • Turnitin. Proprietary, costs money. Compares submitted papers to a large (many billions of items) database of webpages and other documents, including all assignments previously submitted for checking.
  • SafeAssign. Proprietary, costs money. Offered by course management software company Blackboard. Compares submitted papers to a large database of webpages, the ProQuest ABI/Inform database of scholarly papers, and assignments previously submitted for checking.
  • Wcopyfind. Free, open source software from Louis Bloomfield, a physicist at the University of Virginia. Just runs on your hard drive. Compares submitted papers to one another, and to any URLs you specify. You can adjust all of the (many) parameters controlling the matching algorithm. It’s been around for several years, is very fast, and is frequently updated. Bloomfield was prompted to write it when he encountered high levels of plagiarism in his intro physics course.
  • Viper. Proprietary, free. Says it compares submitted papers to a large database of documents, which doesn’t seem to include documents previously submitted for checking.
  • eTBLAST. Free web-based service from the Virginia Bioinformatics Institute. Compares a submitted chunk of text to your choice of one of several mostly-biomedical databases (e.g., MEDLINE), plus a few publicly-accessible websites (e.g., ArXiv, Wikipedia).
  • MOSS (Measure of Software Similarity). Free web-based service for detecting similarities among submitted computer programs. Compares the submitted programs to one another, not to a database (that’s my understanding, anyway). Registration required. Developed by a Stanford prof in 1994, continually updated since. Works with many different programming languages, though not R. Here is a list of other computer program matching tools (not sure if it’s up to date).
  • Google and other general-purpose search engines. Plugging chunks of text into a search engine sometimes will identify webpages or online documents containing the same or similar text.
  • Here’s a list of various other text matching tools. Not sure if it’s up to date.

I’m an instructor. What issues should I be aware of if I’m planning to use text matching software?

  • Does your department or university have a policy on the use of such software? Universities that have a site license for Turnitin or some other proprietary system generally have policies on its use. Google “text matching software policy”, without the quotes, to find various examples. Obviously, if your university has a policy, you should follow it. Universities that don’t have site licenses for a specific software package usually don’t have any policies on text matching software. Which doesn’t mean you should just feel free to use whatever software package you want however you want. I suggest at least asking the advice of colleagues, and probably an administrator responsible for dealing with student academic misconduct. You need to make sure that whatever you do is consistent with the university’s existing policies and procedures regarding academic misconduct.
  • Different software packages do different things, which affects their appropriate use.  For instance, software like Turnitin uploads submitted documents to a database that’s owned by the software company, and that might be located in another jurisdiction than you are, which may raise issues of privacy and intellectual property. That’s why universities that use such software often have policies allowing students to opt out in favor of some alternative means of demonstrating that their work wasn’t plagiarized (and not such an onerous alternative that it effectively forces the students to “consent” to use of the software). In contrast, if you use Wcopyfind or MOSS to compare submitted assignments to one another, that doesn’t raise any privacy or intellectual property issues that I can see. All you’re doing in that case is speeding up comparisons that you could in principle do by hand.
  • Text matching software isn’t a substitute for making sure students know what constitutes plagiarism, why it’s wrong, and that you take it seriously.
  • You need to make sure you know how to use the software and interpret its output (which usually is very easy, from what I understand).
  • It’s a bad idea to just blindly rely on any software package rather than using it to flag potential cases of plagiarism for your inspection.
  • Text matching software can struggle with paraphrased material. Though from what I’ve heard anecdotally, students who plagiarize their assignments rarely go to the trouble of paraphrasing the copied material so extensively that the text matching software can’t detect the copying. If you can customize the search parameters, as with Wcopyfind, you can set them so as to maximize your chances of detecting paraphrased material, perhaps at some cost to speed.
  • Text matching software won’t detect if a student had someone else do their assignment for them. (As an aside, it’s my anecdotal impression that essay writing services produce poor essays, but the students purchasing them mostly don’t care because their only goal is to pass the class, not to get a high mark.)
  • In order to decide which software to use, you might want to think about how students are most likely to plagiarize in the courses you teach. If they’re most likely to copy one another, then arguably there’s not much value in paying for something like Turnitin, as compared to just using something like Wcopyfind.
  • Do you plan to use the software as a deterrent? For instance, by announcing to the class that you’ll be using text matching software, or even announcing what software you’ll be using and how? Note that it’s not clear that text matching software is effective as a deterrent. Rigorous independent studies seem to be scarce, and I haven’t found any that report big drops in plagiarism following adoption of text matching software. That could be for various reasons. Anecdotally, many cases of plagiarism (quite possibly the majority) are from panicked students, or students who aren’t clear on what constitutes plagiarism. Others are from students who figure they have little to lose because they think (possibly incorrectly) that the penalty for plagiarism is fairly minor (say, just a zero on the assignment, with no indication of misconduct on their transcript). None of those categories of students will be deterred by text matching software. And unless the software is used university-wide (and I mean actually used, not just that there’s an option for instructors to use it), students who plan to plagiarize may just drop classes that use text matching software in favor of classes that don’t. Don’t laugh–the University of Alberta surveyed their students on academic integrity matters a few years ago and students reported that this is common (though of course, whether students answer such surveys honestly is a good question).
  • Do you plan to routinely check all student assignments? Routinely checking all assignments arguably is fair–there’s no risk that anyone will feel themselves to be singled out. It minimizes the odds that you’ll miss any cases of plagiarism. And it maximizes the deterrent if you’re using the software as a deterrent. But it might make all students feel like they’re under suspicion, and that they’re at risk of being charged with plagiarism even if just a few stray words coincidentally match some other document. And if your university has an honor code, routine use of text matching software risks undermining the honor code, because what’s the point of an honor code if you’re not going to trust the students to obey it? (On the other hand, if the honor code is already being widely violated, then arguably there’s nothing left for text matching software to undermine.) One way to deal with this is to emphasize to the students that routine use of text matching software is to protect the large majority of students who are honest, and keep the rare dishonest students from getting a leg up on the many honest ones.
  • Alternatively, if you’re not going to routinely check all students assignments, how will you decide what assignments to check? A random sample? Only if there’s some other grounds for suspicion? And if so, what grounds?
  • What are the pluses and minuses of other ways of achieving the same goals? For instance, one way to minimize plagiarism is to write new assignments every time you teach the course. Unfortunately, that’s time-consuming, and doesn’t address the common problem of students in the same course copying from one another. Another approach is to only mark the students on exams and other assignments that they complete in class with you or another observer present. But that rules out many pedagogically-valuable assignments. Or you could go exclusively with project-type assignments that can’t easily be plagiarized, like having students give in-class presentations. But that’s not feasible except in small classes.

In the comments, please share your own experiences with text matching software (as both student and instructor) and relevant links.

15 thoughts on “Using text matching software to detect and deter plagiarism

  1. I have used SafeAssign both in small classes (1-2 sections, I did all the grading) and as one of about 30 TAs in the machine that is Intro Bio at my university. I found it much more useful in the big class. If a student is copying from a friend who is in another TA’s section, or who took the class in another year, the graders might never see the two papers together to notice that they are the same assignment. I never found an assigment plagiarized from the internet purely by using SafeAssign — there were always other ‘tells’ that I could use to then try to find the source. Though, of course, who knows what I missed.

    • ” I found it much more useful in the big class. If a student is copying from a friend who is in another TA’s section, or who took the class in another year, the graders might never see the two papers together to notice that they are the same assignment.”

      Yup. I think preventing that sort of thing is probably the strongest argument for text matching software in science classes.

  2. I’m at a small liberal arts college, and we use Turnitin extensively throughout campus, but especially in the Biology department. In our majors classes, the software helps us identify about one case of plagiarism per year, and in our non-majors course this frequency is a bit higher. Our intention is to use it as an opportunity to teach students about proper use of references and paraphrasing rather than plagiarizing or quoting, which is what they’re used to from high school. Early in the semester we devote a substantial portion of one of the labs to discussing how we use Turnitin, including how we interpret similarity scores and what the consequences are for plagiarism (they’re substantial). Most of our papers require first drafts, and we often detect high levels of similarity at this round, which allows us to have the “fear of god” speech with them. That usually takes care of plagiarism from that particular student, but as you say it’s the panicked students that we usually see on the final drafts of the papers.

    One of the most common questions we get is, “when is a similarity score problematic?”, because they think that 10% similarity is ok, and clearly 100% similarity with another paper or citation is bad, but there must be a cutoff somewhere in the middle. I try to show students with example papers that even 5% can be an issue if, for example, a student copied an entire sentence word for word and didn’t credit the author (which in my book is two separate offenses). Alternatively, because Turnitin also flags citations as similar to other sources, a paper with 0% similarity is problematic – not because of plagiarism, but because the student failed to use any references in their paper.

    • Yes, if you’re going to use the software as a teaching tool, sure sounds like you’re doing it right.

      It’s a bit of a challenge because as you say, as an instructor you have to teach the students *not* to pay attention to some of the information that the software provides. In particular, the % similarity score can’t be used as a measure of whether the assignment is “ok” or not. You certainly don’t want students fiddling with their paraphrasing until they reduce the % similarity score to some magic level. Which of course is exactly what some proprietary packages (like Viper) encourage students to do.

      It’s depressing to hear that, despite everything you’re doing, and despite students knowing that the penalties for plagiarism are worse than just taking a zero on the assignment, occasional panicked students *still* plagiarize.

  3. Useful post, thank you! I find it’s usually not that hard to decide which papers to check – there is a big enough difference between the average undergrad’s writing and published work that plagiarised sections tend to stick out pretty badly (and I have twice now run in to sentences that seemed familiar, only to find that they were copied from my own papers). Of course, I’m always goint to miss some things that way. One of the biggest issues here at my institution is not the step of identifying plagiarism, but in the level of pain involved in the formal processes around dealing with it – it takes days and is seriously unpleasant, and as a consequence I think some academics are inclined to let low-level cases slide.

    • “One of the biggest issues here at my institution is not the step of identifying plagiarism, but in the level of pain involved in the formal processes around dealing with it – it takes days and is seriously unpleasant, and as a consequence I think some academics are inclined to let low-level cases slide.”

      In talking to colleagues, I’ve been struck by how variable the procedures for dealing with academic misconduct are among universities. At some universities, the formal process is indeed a lot of unpleasant work for the faculty. But here at Calgary, it’s actually pretty painless for the faculty, at least in my department. We just forward the evidence to the associate head of dept. for undergraduate studies, who takes it from there.

      EDIT: And I forgot to add: plagiarizing your *own instructor* should win some kind of prize. The Facepalm Trophy or something.

  4. I’m just wrapping up teaching my first university course, senior level plant ecology. I caught a student copying an entire article abstract on a practice exam and read her the riot act, only to have her copy extensively without citation the very next week on a take home exam. All I did was plug anomalous looking sentences, marked by an obvious change in style, into Google which returned the plagiarized text. This method was effective and fast, but wouldn’t catch copying off of another student’s work.

    This same student wrote me that she was puzzled because she had never encountered this problem in past classes, even though many of them had used text matching software! I don’t think that she realized how this sounded, but it did make me wonder if there are some accomplished plagiarists out there who have learned to evade software?

    • @Tom and Angela:

      Yes, we’ve also had incidents in the ecology program at Calgary in which students were caught copying lengthy passages verbatim from published scientific papers, with no quoting or reference to the source. They were given away by the obvious change in writing style.

  5. Over the past 5 years or so i have mentored at least 15 or 20 undergraduates doing research-for-credit in our lab (biochemistry) at my massive-state-institution. Most of these undergraduates are juniors or seniors targeting med school or grad school. Nearly all are straight A types. As a component of my mentoring duties, i read and edit drafts of yearly progress reports, undergrad-poster-session abstracts, and the “papers” they have to write for the variety of courses through which they receive credit for their research. This is an absolutely outstanding group of students; i believe them to be well inside the top 10% (academically) of undergraduates at the institution. And yet, in reading their writeups, one thing stands out above all else (above even their data and results, which are usually publication quality): These undergraduates cannot write worth a _shit_. I realize that academic writing is an extremely difficult thing to do, taking even innately talented writers years of work to master, and I am therefore conscious of imposing unrealistic expectations, but…

    These kids are completely unable to compose simple (to say nothing of complex) sentences. Nor are they able to organize their writing, even at the paragraph level. Sentence i will say exactly the same thing as sentence i+1 and i+3. Sentence i+2 will be the same as i+4. Indeed, much of my “editing” consists of crossing out entire pages of nonsense. And i don’t mean “nonsense” in the scientifically or factually-wrong sense. I mean nonsense as in: the grammar and construction of the sentences and paragraphs is such that the text simply cannot be parsed, even if i crank the “fuzz” setting on my English parser to 11 (most people are equipped with parsers that can only fuzz to 10). Like a CPU under load, I can feel my head heating up as my brain struggles to dissect these documents and develop remotely plausible interpretations of what the student is trying to say. I won’t even comment on higher level organizational issues. In my experience the typical undergraduate assemblage of words is more like some sort of abstract art. If i were to speak the words as written out loud to colleagues, someone would probably call 911, suspecting me of having a stroke.

    I trouble you with this rant in order to set up the following question: Is it /really/ that difficult to detect undergraduate-committed plagiarism? Like, really??? Doesn’t it stick out like uggs at a wedding? Can’t one just look for parts in the submission that look like they may have been written by someone not <= 8 years of age?

    It is possible i exaggerate. Sometimes i exaggerate. But i think you get the idea. Instead of some very time consuming and CPU-intensive bulk-text comparisons, requiring vast databases, many of which cost a lot of money, can't one just scan for the reasonable parts and google them?

    • ” Is it /really/ that difficult to detect undergraduate-committed plagiarism?”

      If they’ve plagiarized a source like a scientific paper, yes, it’s not hard to tell, as other commenters have noted. Identifying the source is another matter, of course. But if they’ve plagiarized one another, you can’t tell from stylistic clues.

      The undergrads I teach also struggle with writing, though not to the extent of the students you teach from the sound of it. At my university the best students tend to be the ones who write pretty well, and the ones who write pretty well tend to be the best students.

  6. Pingback: Recommended reads #42 | Small Pond Science

  7. I teach in an information security program and thus heavily emphasize professional ethics. We are fortunate in that our institution’s course management software (Desire2Learn) fully integrates Turn-It-In into the submissions dropbox feature. All I have to do is check a box and all assignments are reviewed, with a color graph bar illustrating the level of “similarity”.

    What is most useful, we have found, is to REQUIRE students to submit a “draft” 1-2 days before the due date of a written assignment, and then review their own TII reports and revise any text that is flagged but isn’t a direct quote, reference or common-use term.

    We have found the “similarity index” as a score to be less valuable than a visual review of flagged text, as the score is a percentage of flagged text vs total number of words in the paper – thus shorter writing assignments will have higher scores than longer assignments with the same amount of flagged content. A 5% score can have a blatant plagiary (cut and paste text) while a 20% score may simply have a lot of long industry terms, references, citations etc. Again, what’s most useful is to make the student review and revise their own papers. We can then see if there’s any change in the TII score/flagged text.

    That having been said, we’re experiencing about a 5-10% SCAI issue (Student Conduct and Academic Integrity) in our classes, mainly because students get panicked at the 11th hour and just submit, hoping we won’t catch it, even though we make them go through plagiarism modules and let them know we’re using the software. I guess fear of not completing the assignment overrides intelligence at the last minute.

    We are also fortunate in that our SCAI processes are very efficient. When TII indicates an issue with a submission, I simply email our SCAI office, asking if the student has any priors, and if not, schedule a meeting with the SCAI representative and the student. If they do have priors, it automatically goes to a review panel. Our SCAI representative is experienced in handling the issues and in 15-20 minutes either the student accepts responsibility and accepts an “informal penalty” (usually a zero on the assignment and up to a 2 letter grade penalty in the class), or they refute the allegations, and it leaves my office to a convened SCAI review panel. TII evidence makes it an open and shut case.

    The review also catches between-student plagiary as all assignments are added to the database and I’ve caught many students that get assistance from friends that have already completed the course. We are also dealing with the issue is “self-plagiary”. We’ve found some courses have similar writing assignments, or students are retaking a class and resubmit content from a previous course or previous term. On our campus the university policy clearly labels that as plagiary unless the student gets instructor permission to re-use material.

    Our current problem is the use of paraphrasing software. Since students have learned about the TII reviews, they cut and paste content from outside sources into free web paraphrasing apps and then drop it into the assignment. While TII may not flag it, it does usually read like it was written by a computer. The problem is that’s currently not explicitly prohibited by policy so it’s not a SCAI issue (yet). We’re hoping it will be soon, since the student neither wrote the original content, nor wrote the paraphrase or summarization. We’re starting to prohibit direct quotes to force the students to write more themselves, and this has spawned this new problem.

    • Thanks Mike, this is very useful.

      “That having been said, we’re experiencing about a 5-10% SCAI issue (Student Conduct and Academic Integrity) in our classes, mainly because students get panicked at the 11th hour and just submit, hoping we won’t catch it, even though we make them go through plagiarism modules and let them know we’re using the software. I guess fear of not completing the assignment overrides intelligence at the last minute. ”

      Yeah, that’s my experience as well–5-10% of students in a course will plagiarize no matter how much you educate them about what constitutes plagiarism, how many times you remind them that you take misconduct seriously, no matter what you tell them about the countermeasures you use, and no matter what data you give them about the number of students you’ve caught before. Presumably because they’re panicked at the last minute, or because they’re just not that invested in the course and hope to get away with it (and don’t care *that* much whether they’re caught or not).

      We are also fortunate to have pretty efficient SCAI processes that don’t place much burden on profs.

      Thanks for the tip on paraphrasing software. Wasn’t aware that existed. Will have to look that.

  8. Pingback: As an instructor, what do you do about the fact that students are likely to share your assignments online? | Dynamic Ecology

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.