Thursday, July 28, 2011

On the Plagiarism of a Tach-ve-Tat Chronicle

During this period, between the 17th of Tamuz and the 9th of Av, there is an increased focus upon various historical calamities that befell the Jewish people. Jewish history is unfortunately replete with such examples. Some instances have spawned specific days of commemoration while others have produced whole bodies of literature. And, while the literature surrounding these events is diverse, covering liturgy, poetry, history, we focus on one type: the chronicle. Additionally, our focus is the Chmielnicki Massacres, or Gezerot Tach ve-Tat. The Hebrew refers to the dates – 1648-49 – when the majority of Jew-killing took place. While these events took place hundreds of years ago, its effects including the total number of Jews killed is still being debated by scholars. (See Jits van Straten, "Did Shmu'el Ben Nathan and Nathan Hanover Exaggerate: Estimates of Jewish Casualties in the Ukraine During the Cossack Revolt in 1648," Zutot 6:1 (2009), 75-82, calling into question the lower estimates of Shaul Stampfer, "What Actually Happened to the Jews of Ukraine in 1648?" Jewish History 17:2 (May 2003), 207-27.)

The most well-known chronicle describing the events is that of R. Nathan of Hanover, Yaven Metzulah. There is an English translation of Hanover’s work, Abyss of Despair, translated by R. Abraham J. Mesch. The translation includes a “traditional drawing of Maharsha.”

While it is not noted, this illustration, that has Maharsha with long flowing hair first appears in the Vienna, 1814 edition of the Maharsha’s commentary (vol. I, vol. II). While Mesch indicates this is the “traditional drawing” we know of no earlier instance than the Vienna edition. This was not the only Vienna edition that includes a questionable portrait. The 1804 Vienna edition of R. Yitzhak Alfasi’s Halakhot also includes a portrait that is claimed to be R. Alfasi. Again, we know of no earlier evidence that would confirm such a rendering.

A collection of these chronicles was published most recently Gezerat Tach ve-Tat, Jerusalem, 2004. Additionally, Joel Raba, Between Remembrance and Denial, Columbia Univ. Press, 1995, discusses these chronicles as does the collection of articles that appears in the journal, Jewish History 17:2 (May, 2003).

We turn our attention, however, to a lesser known work from this period, Tzok ha-Itim. Tzok was actually the first chronicle regarding the 1648-49 events published. It was first published in Krakow, 1650 (link). Indeed, some have argued that Hanover relied heavily on Tzok in compiling Yaven Metzulah (first published in 1653).

Tzok was republished in Constantinople in 1652. This edition is exceedingly rare. According to Ya’ari, there is but one complete copy extant. (See Ya’ari, Kiryat Sefer, (16) 1939-40). This edition was published by R. Shmuel ben R. Shimson who on his way to Israel after fleeing the massacres. At the end of the book he includes a dirge (kinnah) about the events. He also penned his own introduction which describes his own suffering. He says that “I am the only remaining survivor in my family as the rest were killed sanctifying god’s name . . . although I was spared . . . my wife and children I buried, I lost all of my possessions . . . .” He explains that “all I wanted was to dwell in the bet midrash and therefore I decided to travel to Jerusalem” and that while he was on his journey he came across Tzok and decided to reprint it in Constantinople “so that what has occurred shall not be forgotten.” (Ya’ari, Mechkerei Sefer, Jerusalem, 1958 p. 16 reprints the entire introduction, he also provides other accounts of people, who, on their way to Israel, issued works related to 1648-49 massacres.)

Tzok was then reissued in Venice in 1656.

The first two editions list R. Meir ben Shmuel of Szczebrzeszyn as the author. The 1656 edition, however, lists a completely different author, R. Joshua ben David of Lemberg. It is not only on the title page that a different author is listed. The work itself is not composed as traditional narrative. Instead, it is written in verse. The first verses in all the editions spell out the author’s name in an acrostic. Thus, the 1650 and 1652 editions have an acrostic that spells out R. Meir of Szczebrzeszyn’s name while the 1656 edition acrostic spells out R. Joshua’s name. In some instances words are added to create the “new” acrostic, while in other instances, the highlighted letters are changed.

Here is the introduction to the Constantinople edition:

And here is the introduction to the Venice edition:

As an aside it is worth noting that this is not the only time a plagiarizer has been forced to change the acrostic to hide his stolen goods. (See Kitvei Pinchas Turburg, ed. A. R. Malachi, 24-36 for additional examples of acrostic changes, and see this earlier post discussing similar changes to hide the identity of the true author, and see this post where the plagiarizer was caught in the act and forced to admit his guilt and apologize). Additionally, at least in one instance the acrostic was able to demonstrate authorship. In the Siddur Bet Ya’akov (although attributed to R. Y. Emden, this siddur contains numerous additions as compared to R. Emden’s actual siddur called Ammudei Shamayim - Sha'arei Shamayim; this is one of them) the Belzer Rebbe asserts that the author of the zemer Yom Shabbat Kodesh Hu had his song stolen. He came across the plagiarizer and challenged him to prove authorship. Specifically, the real author showed that his name, Yonatan, could be seen in the acrostic, and with this he vanquished the thief. R. Emden uses this story to explain the meaning behind the final verse which loosely translated as “all the talk [about authorship] should [now] end now that I have enlarged the song [and demonstrated my authorship] . . . and that no one should ever steal from me as this song is my property.”

An example where the acrostic actually has the opposite effect, obscuring the original author is also a zemer, Yom Zeh le-Yisrael. At times, this song can be confusing depending upon which bencher one is using. This so, because some version have a shorter version while others have a longer ( see here for example). Some argue that the two versions are indicative of two authors, one, the original author which only spelled out Yitzhak (and then lamed vav) to which all the other verses were added, now spelling Yitzhak Luria Hazak. (Regarding this zemer see Naftali ben Menachem, Zemirot shel Shabbat, Israel, 1949, 144-45; I. Davidson, Thesaurus of Mediaeval Hebrew Poetry, Ktav, 1970, vol. II, 348.)

Returning to Tzok, because the acrostic lends support for either author, some didn’t know who the “real” author was. In the 1890s, a number of these chronicles regarding bad events in Jewish history were collected and published under the title Le-Korot ha-Gezerot ‘al Yisrael by C. Gorlin. Included is Tzok. But, instead of a traditional introduction, he prefaces Tzok with a section “Who is the real author?” Gorlin argues that the real author is indeed R. Meir and not R. Joshua. This is not the first time that there is some confusion regarding who is the real author and who is the thief, for another example see here and for another example of modern day plagiarism see here.

With regard to the Constantinople edition, Ya’ari demonstrates that this edition is better than the first, in that many of the typos and the like have been corrected. Unfortunately, perhaps due to its rarity, the 2004 edition of Tzok relies upon the 1650 edition and not the better 1652. Additionally, the 1652 edition is one of the works published by a convert. Of course, this is probably what first got Ya’ari interested as he provides a bibliography of works published by converts.

It should be noted that Tzok was rather popular even if it is no longer. When R. David ha-Levi Segal, author of the commentary on Shulhan Orakh, Turei Zehav, sent a delegation to the false messiah, Shabbatai Tzvi, when the delegation entered, they record that Shabbatai Tzvi had a copy of Tzok on the table. (See G. Scholem, Sabbati Sevi, Princeton Univ. Press, 1976, p. 623 quoting Leib Ozer, Sippurei Ma’ashi Sabbati Tzvi, p. 81 and Sefer Tziz Nobel Tzvi, ed. I. Tishby, pp. 77-79.)

Finally, we note that the most recent edition, the 2004 op. cit., uses the Krakow first edition, even though Ya'ari has already shown that the rare Constantinople edition corrected numerous errors that appear in the Krakow edition.

Monday, July 11, 2011

Attribution and Misattribution: On Computational Linguistics, Heresy and Journalism

Attribution and Misattribution: On Computational Linguistics, Heresy and Journalism

by Moshe Koppel

Prof. Moshe Koppel is on the faculty of the Computer Science Department at Bar-Ilan University. He has published extensively on authorship attribution, as well as on a diverse array of topics of Jewish and scientific interest.
A few days ago, newspaper readers from New Jersey to New Zealand read about new computer software that "sheds light on the authorship of the Bible"[1]. By the time the news circled back to Israel, farteitcht and farbessert, readers of Haaretz were (rather gleefully) informed that the head of the project had announced that it had been proved that the Torah was written by multiple human authors[2], just as the Bible critics had been saying all along.

I'm always skeptical about that kind of grandiose claim and this is no exception, even though the person who allegedly made the claim in this particular case happens to be me. The news reports in question refer to a recently published paper[3] in computational linguistics involving decomposition of a document into authorial components. A brief reference to application of the method to the Torah (Pentateuch) is responsible for most of the noise.

In what follows, I’ll briefly provide some background about authorship attribution research, sketch the method used in the paper, outline the main results and say a few words about what they mean. My main purpose is to explain what has actually been proved and, more crucially in this case, what has not been proved.

Authorship Attribution

One of my areas of research for over a decade has been authorship attribution, the use of automated statistical methods to identify or profile the author of a given text. For example, we can determine, with varying degrees of accuracy, the age, gender and native language of the author of a text[4]. Under certain conditions, we can determine, with varying degrees of certainty, if two texts were written by the same person[5]. Some of this work has been applied to topics of particular interest to students of Jewish texts, such as strong evidence that the collection of responsa Torah Lishmah was written by Ben Ish Chai[6] (although he often quoted the work as if it were written by someone else) and that all of the letters in Genizat Harson are forgeries[7].

Whenever I have lectured on this topic, the first question has been: have you ever analyzed the Bible? The honest truth is that I never really understood the question and I suspect that in most cases the questioner didn't have any very well-formed question in mind, beyond the vague thought that the Bible is of mysterious provenance and ought to be amenable to some sort of statistical analysis. I would always mumble something about the question being poorly defined, Bible books being too short to permit reliable statistical analysis, etc. But, while all those excuses were quite true, I also had a vague thought of my own, which was that whatever well-formed research question I could come up with regarding Tanach, it would probably land me in hot water.

One research question that I have been working on with my graduate student, Navot Akiva, involves decomposition of a document into distinct stylistic components. For example, if a document was written by multiple authors, each of whom presumably writes in some distinct style, we'd like to be able to identify the parts written by each author. (Bear in mind this is what is known in the jargon as an unsupervised problem: we don't get known examples of each author's writing to analyze. All we have is the composite text itself, from which we need to tease apart distinctive looking chunks of text.) The object is straightforward: given a text, split it up into families of chunks in the best possible way, where by "best" we mean that the chunks that are assigned to the same family are as similar to each other as possible.

Even I could see that this could have some bearing on Tanach. So when Prof. Nachum Dershowitz, a colleague with whom I share a number of research interests, introduced me to his son, Idan, a graduate student in the Tanach program at Hebrew University, we agreed to consider how to apply this work to Tanach (sort of fudging the question of whether this meant Torah or Nach). It happens that, apart from being the most studied and revered set of books ever written, Tanach offers another advantage as an object of linguistic analysis: precisely because it has been the subject of so much study, there are many available automated tools that we could exploit in our research.

The Method

Here's how our computerized method works. Divide a text into chunks in some reasonable way. These chunks might be chapters or some fixed number of sentences or whatever; the details aren't critical and need not concern us at this stage. I'm going to call these chunks "chapters" (only because it is a less technical sounding word), but bear in mind that we are not assuming that a chapter is stylistically homogeneous; that is, the split between authors might take place in the middle of a chapter.

Our object is to split our collection of chapters into families of stylistically similar chapters. (The chapters in a family need not be contiguous.) All the chapters that look a certain way, please step to the left; all others, please step to the right.

As a first step, for any pair of chapters, we're going to have to measure the similarity between them. The trick is to measure this similarity in a way that captures style rather than content.

The way we do it is as follows: we begin by generating a list of synonym sets. For example, for the case of Tanach, we would consider synonym sets such as betoch, bekerev; begged, simla; sar, nasi; makel, mateh, shevet; and so on. There are about 200 such sets of Biblical synonyms. We generate this list automatically by identifying Hebrew roots that are translated by the same English root in the KJV. Note that not every occurrence of, for example, shevet (which can mean either “staff” or “tribe”) is a synonym for makel (which is always “staff”). We use online concordances to disambiguate, that is, to determine the intended sense of a word in a particular context. (In this respect, Tanach is especially convenient to work with.)

For every chapter and every such set of synonyms, we record which synonym (if any) that chapter uses. The similarity of a pair of chapters reflects the extent to which they make similar choices from among synonym sets. The idea is that if one chapter uses – for example – betoch, sar and mateh and the other uses bekerev, nasi and makel, the two chapters have low similarity. If a chapter doesn’t use any of the synonyms in a particular synonym set, that set plays no role in measuring the similarity between that chapter and any other chapter.

Once we know the similarity between every pair of chapters, we use formal methods to create optimal families. Ideally, we want all the chapters in the same family to be very similar to each other and to be very different from the chapters in other families. In fact, such clean divisions are unusual, but the formal methods will generally find a near-optimal clustering into families. (What we call families are called “clusters” in the jargon, and the process of finding them is called “clustering”. The particular clustering method we used is a spectral approximation method called n-cut.)

A key question you should ask at this point is: how many families will we get? You might imagine that the clustering method will somehow figure out the right number of families. Indeed, there are clustering methods that can do that. But – note this carefully – the number of families we obtain is not determined by the clustering method we use. Rather it is given by us as an input. That is, we decide in advance how many families we want to get and the method is forced to give us exactly what we asked for. This is a crucial point and we'll come back to it when we get to the meaning of all these results below.

In any case, at this stage, we have a tentative division of chapters into however many families we asked for. (For simplicity, let's assume that we have split the chapters into exactly two families.) This is not the final result, for the simple reason that we have no guarantee that the chapters themselves are homogeneous. The next step is to identify those chapters that are at the core of each family; these are the chapters we are most confident we have assigned correctly and are consequently the ones most likely to be homogeneous. (Note that when I say "we are confident" I don't mean anything subjective and wishy-washy; all this is done automatically according to formal criteria a bit too technical to get into here.)

Now that we have a selection of chapters that are assigned to respective families with high confidence, we use them as seeds for building a "model" that distinguishes between the two families. Very roughly speaking, we look for common words (ones not tied to any specific topic) that appear more in one family than in the other and we use formal methods (for those interested, we use SVM) to find just the right weight to give to each such word as an indicator of one family or the other. We now use this model to classify individual sentences as being in one family or the other.


Wonderful, so we did all sorts of geeky hocus-pocus. Why should you believe that this works? Maybe the whole synonym idea is wrong because we ignore subtle differences in meaning between "synonyms". Maybe the same author deliberately switches from one synonym to the other for literary reasons. Maybe we are biased because we believe something wicked and we subtly manipulated the method to obtain particular results.

These are legitimate concerns. That's why we test the method on data for which we know the right answer to see if the method gives that right answer. In this case, our test works as follows. We take two books, each of which we can assume is written by a single distinct author, mix them up in some random fashion, and check if our method correctly unmixes them. In particular, we took as our main test set random mishmashes of Yirmiyahu and Yechezkel.

We found that the method works extremely well. About 17% of the psukim could not be classified (no differentiating words appeared in these psukim or their near neighbors). Of the approximately 2200 psukim that were classified into two families, all the Yirmiyahu psukim went into one family and all the Yechezkel psukim went into the other, with a total of 26 (1.2%) exceptions. We obtained similar results on a variety of other book pairs.

So maybe we should have left well enough alone. But with a power tool like this in hand, how could you not want to see how it would split the chumash? Shoot me, but for me, like Rav Kahana hiding under his rebbe's bed, Torah hee velilmod any tzarich. We did the experiment. I should hasten to mention, though, that the chumash experiment is only briefly mentioned in the published paper, which focuses on proving the efficacy of the method (it’s a computational linguistics paper, not a Bible paper).

Now, I should point out that until I got involved in this, I was a complete am haaretz in Bible Criticism, a perfectly agreeable state of affairs, as far as I was concerned. However, Idan Dershowitz immediately observed that our split was very similar to the split between what critics refer to as the Priestly (P) and non-Priestly portions of the Torah. Bear in mind that there are ongoing disagreements among the critics about precisely which psukim should be regarded as P and which not. We took two standard such splits, that of Driver and that of Friedman, and refer to the set of psukim for which they agree as “consensus” psukim. (They agree just over 90% of the time.)

Here’s the result. Our split of the Torah into two families corresponds with their split for about 90% of all consensus psukim.

Let me say a few words about the main areas of disagreement. To a significant extent, our split runs along lines of genre. One family is mostly – not completely – legal material and the other is mostly narrative. Since what the critics call the Priestly sections include pretty much all of Vayikra (which is mostly laws), as well as selected portions of Bereishis, Shemos and Bemidbar, their split also corresponds somewhat to the legal/narrative split. Most of the cases where our split is different than theirs involve narrative sections that they assign to P and our method assigns to the family that corresponds to non-P, for example, the first chapter of Bereishis. (The rest of the disagreements involve P sections that scholars now refer to as H and consider some sort of quasi-P, but I don’t want to get into all that, mostly because I’m still pretty clueless about it.)

Before you dismiss all this by saying that all we did was discover that stories don’t look like laws, let me point out there are plenty of narrative sections that the computerized analysis assigned to the P family (or, more precisely, to the nameless family that turns out to be very similar to what the critics call the P family). Two prominent examples are the story of Shimon and Levi in Shechem and the story of Pinchas and Zimri.

One more point: when we split the Torah into three or more families, our results do not coincide with those of the critics. In the case of three families, Devarim does seem to split off as its own family, as the critics claim, but there are a fair number of exceptions. And even with four or more families, no hint of the critics' E/J split shows up at all.

Interpreting the Results

So does all this mean that we have proved that the Torah was written by at least two human authors, as the breathless reports claim? No.

First of all, as I noted above, our method does not determine the optimal number of families. That is, it does not make a claim regarding the number of authors. Rather, you decide in advance how many families you want and the method finds the optimal (or a near-optimal) split of the text into that number. If you ask it to split Moby Dick into two (or four or thirteen) parts, it will do so. Thus the fact that we split the Torah into two tells us exactly nothing about the actual number of authors.

Having said that, I want to temper any religious enthusiasm such a disclaimer might engender. First of all, with a few improvements to the method we could probably identify some optimal number of families for a given text. We simply haven’t done so. Second, the fact that – for the case of two families – the results of our method coincide (to some extent) with those of the critics would seem to suggest that the split the method suggests is not merely coincidental.

But, the deeper reason that our work is irrelevant to the question of divine authorship is simply that it does not – indeed, it could not – have a thing to say on that question. If you were to have some theory about what properties divine writing ought to have and close analysis revealed that a certain text probably did not have those properties, then you might have to change your prior belief about the divine provenance of that text. But does anyone really have some theory about what divine texts are supposed to look like? Several press reports about this work referenced the idea that “God could write in multiple voices”. I find that formulation a bit simplistic, but it captures the fact that any attempt to map from multiple writing styles to multiple authorship must be rooted in assumptions about human cognition and human performance that are simply not relevant to the question of divine action[8].

In short, our results seem to support some findings of higher Bible criticism regarding possible boundaries between distinct stylistic threads in the Torah. These results might have some relevance regarding literary analysis of the Torah. Taken on their own, however, they are not proof of multiple authorship. Furthermore, there is nothing in these results that should cause those of us committed to the traditional belief in divine authorship of the Torah to doubt that belief.

[3] M. Koppel, N. Akiva, I. Dershowitz and N. Dershowitz, (2011). Unsupervised Decomposition of a Document Into Authorial Components, Proceedings of ACL, pp. 1356-1364.
[4] S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically Profiling the Author of an Anonymous Text, Communications of the ACM, 52 (2): pp. 119-123 (virtual extension).
[5] M. Koppel, J. Schler and E. Bonchek-Dokow (2007), Measuring Differentiability: Unmasking Pseudonymous Authors, JMLR 8, July 2007, pp. 1261-1276.
[6] M. Koppel, D. Mughaz and N. Akiva (2006), New Methods for Attribution of Rabbinic Literature , Hebrew Linguistics: A Journal for Hebrew Descriptive, Computational and Applied Linguistics, 57, pp. 5-18.
[7] מ. קופל, זיהוי מחברים בשיטות ממוחשבות: "גניזת חרסון", ישורון כג (אלול ה'תש"ע), תקנט-תקסו.
[8] I realize that this argument comes close to asserting that the claim of divine authorship is unfalsifiable, which for some might cast doubt on the meaningfulness of that claim. A proper response to that concern would involve a discussion of the nature and content of religious belief, a discussion that is well beyond the scope of this brief peroration.

Print post

You might also like

Related Posts Plugin for WordPress, Blogger...