Globe and Mail: Figure skating judging system still has flaws

Serious Business · Feb 16, 2012

skatinginbc said:
Does the Technical Panel of the current system actually count the features achieved? What is shown on their computer screen? A checklist of features for them to click on the box if the feature is achieved? I am not a skater and have no clear idea how they actually do the judging. I'm inclined to believe they are judging by impression and may review the footage if they are not in agreement.

From what I can remember... based on what I've read: The tech panel does count features. From CoyoteChris's report as an audience member at the US nationals, where he got commentary from an IJS tech specialist: a panel would divvy up the work, one panelist would count up the revolutions in a spin, and another would look for edge changes. I think there's a video review if they're unsure on anything. But the panel very much works together, which wouldn't be the case in your proposed model.

Regardless, I think the basis of your idea is great. And it can be implemented in some form.

hurrah · Feb 16, 2012

I really like skatingbc’s suggestion! His way of judging would not instate a check & balance system (which I’m not sure is a good idea in this particular scenario; I think it would more likely enable judges' personal bias to creep into the score). Rather, it would enable sovereignty between the scoring of levels and quality. And scoring system that makes ‘gray area’ numerically visible is a great, great idea.

skatinginbc · Feb 16, 2012

One specialist counts the rotations, another looks for edge changes. It means only one person is making the decision on a particular technical aspect and therefore subjectivity plays a significant role. Is it more reliable than seven judges give out assessments based on impression? I'm not sure. I need reliability studies to prove it. Let's think about this question: Which is more accurate: (1) a report based on one person who watches an object for 7 minutes, or (2) a report based on seven people who watches an object for only 1 minute?

gkelly · Feb 16, 2012

skatinginbc said:
One specialist counts the rotations, another looks for edge changes. It means only one person is making the decision on a particular technical aspect and therefore subjectivity plays a significant role. Is it more reliable than seven judges give out assessments based on impression? I'm not sure. I need reliability studies to prove it. Let's think about this question: Which is more accurate: (1) a report based on one person who watches an object for 7 minutes, or (2) a report based on seven people who watches an object for only 1 minute?

I'm not sure how that last question applies. It would depend on what the object is, I guess, and how it changes over time.

Anyway, what might be interesting would be to have a number of skaters perform a variety of spins, both simple and complex, and compare two or more of the following methods of evaluating them:

1) 9 judges each give one score (on a scale of 1-10? 1-6? with or without decimals?) for each spin based on overall impression

2) Skaters perform a set of three specified spins -- e.g., a flying spin, a layback, a combination spin with one change of foot and all three basic positions -- connected by whatever skating moves they like; judges give one score on a scale of 1-6 with decimals for each skater's set of three spins. Or two scores for the whole set of spins: one for average difficulty and technical quality of the three spins and the transitions directly in and out of them, another score for artistic impression of the spins themselves and of the connecting moves and performance as a whole

3) 1 official is assigned to identify kinds of spins (that's a flying sitspin, that's a combination spin with change of foot, etc.), and each kind of spin has a base mark; 9 judges each give a grade of execution -5 to +5 according to specific guidelines -- the middle score (5 or 0) means the spin meets the requirements for that kind of spin with acceptable quality; lower scores reflect varying numbers or severity of mistakes or weaknesses according to specific rules; higher scores reflect up to 3 levels of better quality and/or up to 3 areas of added difficulty at the discretion of each judge -- there would be a published list of examples, but if a skater gets creative with a brand new variation, each judge can decide for him/herself whether it looks difficult enough to reward with a point for difficulty; difficulty points, positive quality points, and negative quality points can be added and subtracted up to the maximum and minimum and applied to the base value. -5 for severe or multiple errors would subtract the full base value of the spin to end up with no points

4) One group of officials (technical panel) is assigned to identify the element and as a group to determine which features, from a predefined list were attempted and whether the skater executed each feature well enough to get credit for it, yes or no; yes for 2, 3, or 4 features earns higher base marks; a second group of judges assigns grades of execution -3 to +3 based on a list of errors and a list of positive qualities

5) One group of technical judges each independently identify the elements and independently assign difficulty levels based on which features each one of them can recognize in real time; the computer takes the majority identification and averages the levels, but in certain ambiguous situations a conference review will occur afterward; a separate group of judges assigns grades of execution positive and negative as under 3), and they also take an additional deduction if the skater falls on the spin, so a spin that was bad in general and also earned a fall deduction could end up earning negative total points

Choose at least two of the methods above. Instruct the skaters and officials about the rules of each method. Hold the test competitions.

Now, how do you measure which method is more reliable?

hurrah · Feb 16, 2012

Hey, I was just reading fsuniverse and they said coaches are technical controllers!? My conspiracy theory just got a huge boost.

So definitely make it '(2) a report based on seven people who watches an object for only 1 minute' to make it less likely for cronyism to influence judging.

Is all this attempt to preserve the ability to cheat judge worth it? Is there anything honorable about it?That through such a process figure skating will revive its popularity in NA?

skatinginbc · Feb 17, 2012

Blades of Passion said:
I don't think the PCS scoring should be split up into two separate groups like that.

It is to reduce the Halo Effect, a common measurement error associated with Multiple Trait Scoring. When the raters are asked to make multiple judgments, there is a tendency that they actually make only one. The judgment on the first scale (Skating Skills) is likely carried over to other scales (e.g., PE, IN). An evaluation consultant can mathematically estimate the size of the Halo Effect through a reliability study. By separating them into two groups, I presumed an inherent correlation within group and disparity between groups (e.g., those who have better skating skills are able to incorporate more transitions in their programs, but they do not necessarily can perform, interpret or sell the program to the audience).

Blades of Passion said:
The other problem with the proposal is that the protocols would become much more complicated. An extra line would have to be added for every element; casual fans trying to learn more about the scoring would become even more confused.

It depends on how "casual" they are. It may be easier for the TV commentators though. Instead of technical vs. presentation scores, they can now present them as difficulty vs. execution scores and avoid the 50% artistry assumption when it is in fact not.

Blades of Passion · Feb 17, 2012

skatinginbc said:
It is to reduce the Halo Effect, a common measurement error associated with Multiple Trait Scoring. When the raters are asked to make multiple judgments, there is a tendency that they actually make only one. The judgment on the first scale (Skating Skills) is likely carried over to other scales (e.g., PE, IN). An evaluation consultant can mathematically estimate the size of the Halo Effect through a reliability study. By separating them into two groups, I presumed an inherent correlation within group and disparity between groups (e.g., those who have better skating skills are able to incorporate more transitions in their programs, but they do not necessarily can perform, interpret or sell the program to the audience).

While some of that does have solid basis, I think you're trying to cure the symptom rather than the cause. If the judges are not able to separate the component scores and understand them for what they are, then they won't be scoring the Performance/Choreography/Interpretation components correctly to begin with even if you do remove the other two components from their score sheet. They will still see the skating skills and transitions when the watch the performance and be influenced. Plus, you can have amazing "Performance" and terrible "Choreography" and/or "Interpretation" (that is the epitome of Plushenko in several cases), so putting those three components together doesn't really help much at all to make the judging better.

The only real way for the judging to get better is for the judges to become more educated and to be less afraid of giving their opinion and to be less influenced by reputation/competition momentum when it comes to scoring a performance.

KKonas · Feb 17, 2012

skatinginbc said:
Jackie Wong said in his article, "The technical specialist is the person who identifies the error as a fall, and then the technical panel votes on whether or not it should be counted as a fall. The technical panel is made up of three people, the technical controller, the technical specialist, and the assistant technical specialist." I am confused. Does that mean there will be no vote if the technical specialist does NOT call out a "fall" in the first place? If so, it's easy for a "lenient" specialist to manipulate the outcomes, isn't it? Also, how do they "vote"? Anonymously, or through discussion like "I think we should give him the benefit of doubt. What do you think?" If there is a brief talk before the "vote", group dynamics (e.g., conformity to the "leader", to a friend, or to whoever expresses his judgment first) would play a significant role.

Tech Specialist and Asst Tech Specialist have equal status. They decide beforehand who is going to watch what, especially in pairs/dance where one watches the man, the other the girl. They both watch noted problems in slow motion. If they don't agree on an issue, the Tech Controller who watches generally makes the final decision. Coaches can be Tech specialists if they are former skaters, but are not Tech Controllers. Tech Controllers are all ISU judges, a major distinction.

hurrah · Feb 17, 2012

I stand corrected on the previous statemenr that coaches are controllers. Nevertheless, whether they be specialist or controller, coaches have vested interest and they should not be part of a three member team that does have decisive influence on the outcome. Even if not all coaches exercise favoritism every time when they make edge/rotation calls, it makes it look corrupt because it systematizes the ability to cheat, thus making it a very unaccountable judgement method.

I just read through ISU's explanation of the PCS categories, and I find it hard to differentiate between 'performance' and 'interpretation'. After reading it again and again, I interpreted that 'performance'mark is about judging how well the skater uses his/her body to express the music, and 'interpretation' mark is about judging how well the skater uses his/her facial expression to express the music's theme or the character he/she is playing. That's what I got from reading the explanation, but I think these two categories are really confusing and maybe they should just be collapsed into one? I believe myself to have average reading comprehension, and the explanation is obtuse to his average reader.

gkelly · Feb 17, 2012

hurrah said:
I stand corrected on the previous statemenr that coaches are controllers. Nevertheless, whether they be specialist or controller, coaches have vested interest and they should not be part of a three member team that does have decisive influence on the outcome. Even if not all coaches exercise favoritism every time when they make edge/rotation calls, it makes it look corrupt because it systematizes the ability to cheat, thus making it a very unaccountable judgement method.

Well, technical specialists who are coaches can't serve on panels where they have skaters in the competition, and there other some other restrictions on what constitutes conflict of interest. So most of the time the panel members will not have a "vested interest" in the results of skaters they have no relationship to outside this event.

At worst you might get a TS who is a coach calling an event in which some of the skaters sometimes compete against their own skaters. So a really devious TS could try to affect the sometimes-rival's results at this competition (if it's one that matters to world rankings etc.) or to affect the rival's confidence for future events.

It could happen, but it would be an exception. Most of the time there's no connection to any of the skaters in the event, so there's no favoritism to exercise.

I just read through ISU's explanation of the PCS categories, and I find it hard to differentiate between 'performance' and 'interpretation'. After reading it again and again, I interpreted that 'performance'mark is about judging how well the skater uses his/her body to express the music, and 'interpretation' mark is about judging how well the skater uses his/her facial expression to express the music's theme or the character he/she is playing. That's what I got from reading the explanation, but I think these two categories are really confusing and maybe they should just be collapsed into one? I believe myself to have average reading comprehension, and the explanation is obtuse to his average reader.

Yeah, there is some overlap between those categories.

As I understand it, the Interpretation is everything about how the skater relates to the music during the performance. The Performance/Execution is more about the way the skater relates to the spectators and the surrounding space and also about the clarity of their body shapes (posture, extension, etc.) confidence and cleanness with which they execute the elements.

All the P/E criteria should still apply even if there were no music played during the performance. But the IN criteria wouldn't make sense with no music.

skatinginbc · Feb 17, 2012

gkelly said:
Now, how do you measure which method is more reliable?

There are many ways to determine reliability. I'm going to just name two simple ones and explain them in brief: (1) Inter-rater reliability (The degree of agreement among scores given by different raters), and (2) Intra-rater reliability (the degree of agreement among scores given to the same skating performance by the same rater at different times). Correlation coefficients (e.g., Pearson product-moment correlation coefficient, Intra-class correlation coefficient, etc.) are calculated as the estimates for the reliability.
Some of your examples may involve validity issues. Say, two different scoring methods are designed to evaluate the same skills. Which one is better? Well, superficially they are measuring the same, but are they in fact so? We can calculate correlation coefficients to see if the ranking outcomes produced by the two methods are virtually the same. If they basically produce the same results, whichever the easiest and cheapest is the better design.

Blades of Passion said:
I think you're trying to cure the symptom rather than the cause. If the judges are not able to separate the component scores and understand them for what they are, then they won't be scoring the Performance/Choreography/Interpretation components correctly to begin with even if you do remove the other two components from their score sheet.

Indeed, as long as it is Multiple Trait Scoring, there is a risk of the Halo Effect. As I said before, my proposal is to reduce it (not eliminate it. Eliminating it might not be feasible or cost-effective). To enhance inter-rater reliability, one can do two things: (1) to improve the scoring methods, and (2) to improve raters' skills through training and monitoring.

hurrah · Feb 17, 2012

gkelly said:
As I understand it, the Interpretation is everything about how the skater relates to the music during the performance. The Performance/Execution is more about the way the skater relates to the spectators and the surrounding space and also about the clarity of their body shapes (posture, extension, etc.) confidence and cleanness with which they execute the elements.

All the P/E criteria should still apply even if there were no music played during the performance. But the IN criteria wouldn't make sense with no music.

Okay, in that case, why is performance/execution defined as:

Performance is the involvement of the skater/couple/teams physically, emotionally and intellectually as they translate the intent of the music and choreography. Execution is the quality of movement and precision in delivery. This includes harmony of movement in pairs, ice dancing and synchronized skating.

If what you say is what P/E's about, just write:

Performance is the involvement of the skater/couple/teams physically, emotionally and intellectually with the spectator as they skate in the rink.

So that would mean, what? Performance marks how much the audience feels drawn/attracted to the skater? That's actually a really fishy score to mark. How is anyone meant to measure that? By how much the audience claps after the performance, or how many flowers are thrown into the rink? And then there's the problem of having someone like Kiira Korpi and she doesn't need to make any effort for the audience to be drawn to her, and then you can have someone who is less attractive with a bad body shape and what are they supposed to do? Is that fair, really? This category should be chucked out.

gkelly · Feb 17, 2012

Have you read the more detailed criteria?
http://www.isu.org/vsite/vfile/page/fileurl/0,11040,4844-152086-169302-64121-0-file,00.pdf

For Performance/Execution:

Physical, emotional, and intellectual involvement
In all skating disciplines each skater must be physically committed, sincere in emotion,
and equal in comprehension of the music and in execution of all movement.

Carriage
Carriage is a trained inner strength of the body that makes possible ease of movement
from the center of the body. Alignment is the fluid change from one movement to the
next.

Style and individuality/personality
Style is the distinctive use of line and movement as inspired by the music.
Individuality/personality is a combination of personal and artistic preferences that a
skater/pair/couple brings to the concept, manner, and content of the program.

Clarity of movement
Clarity is characterized by the refined lines of the body and limbs, as well as the precise
execution of any movement.

Variety and contrast
Varied use of tempo, rhythm, force, size, level, movement shapes, angles, and, body
parts as well as the use of contrast.

Projection
The skater radiates energy resulting in an invisible connection with the audience.

Unison and “oneness” (Pair Skating and Ice Dancing)
Each skater contributes equally toward achieving all six of the performance criteria.

Balance in performance (Pair Skating and Ice Dancing)
Spatial Awareness between partners – management of the distance between partners and
management of changes of hold (Pair Skating and Ice Dancing)
The use of same techniques in edges, jumping, spinning, line, and style are
necessary concepts of visual unison; both skaters must move alike in stroke, and
movement of all limbs and head with an equal workload in speed

I see this as kind of a catch-all component for everything else that seems important to some people but isn't covered elsewhere.

Some of the criteria are kind of vague and touchy-feely -- most of the time I think they wouldn't apply, but if a skater just really grabs the judge emotionally by the way they commit to the performance, or seems to grab the audience, this would be the place to reward it.

Although it's not explicitly stated, this could also be the component in which to penalize a sloppy performance with many small or large errors, or to reward one in which all the elements are performed especially well.

hurrah · Feb 17, 2012

gkelly said:
Have you read the more detailed criteria?
http://www.isu.org/vsite/vfile/page/fileurl/0,11040,4844-152086-169302-64121-0-file,00.pdf

For Performance/Execution:

I see this as kind of a catch-all component for everything else that seems important to some people but isn't covered elsewhere.

Some of the criteria are kind of vague and touchy-feely -- most of the time I think they wouldn't apply, but if a skater just really grabs the judge emotionally by the way they commit to the performance, or seems to grab the audience, this would be the place to reward it.

Although it's not explicitly stated, this could also be the component in which to penalize a sloppy performance with many small or large errors, or to reward one in which all the elements are performed especially well.

Yeah, I read it. I was just pointing out that ISU wording is confusing, and that I, of average English comprehension ability, could hardly understand the difference between performance and interpretation, and then you came back and said there's overlap, but performance is not about relation to music whereas interpretation is. And based on your explanation, I pointed out that the wording was then actually wrong.

I mean, don't you think that the wording should be as accurate as possible so you get accurate judging? And there shouldn't be 'vague and touchy-feely' category if you want a marking system that's accountable.

I get that 'Execution'---be able to judge whether it was well-executed or sloppy---should be there.

And of course I would love for 'carriage' to matter but I've not seen that carriage matters that much---if it did, Mao should be getting better scores in this category for her most excellent carriage---so then let's not be hypocritical and have it written there. Or maybe just chuck it in 'interpretation'?

skatinginbc · Feb 18, 2012

What can the ISU do in a short term to improve reliability of the current system without changing scoring guidelines or rules that would affect the skaters?
My suggestions:
(1) To ensure at least two independent judgments on element levels and other technical aspects (e.g., downgrade, edge call, etc.): From reading the posts, I gathered that the technical panel work together, one watching the man and the other the girl, one counting the rotations and the other for wrong edges, and so forth. Although there are three people, they work as a team to make a collective judgment. In other words, they do not always produce two independent voices. My suggestion is to increase the size of technical panel to five: Technical controller and two teams of two specialists. Each team works independently from each other and has no knowledge of the other team's calls. When there is a disagreement between the teams, the technical controller reviews the footage and makes the final call, siding with either Team A or Team B (I hope that controller can also in the future have the option of giving a "gray area" call, where the base value or deduction for that executed element is the average points assigned by the two teams.)
(2) To give an automatic warning to scores that are likely due to errors. The computer can be designed as such that the mean GOE/PCS score for each element/category is calculated as soon as all judges enter their scores for that element/category, and that an automatic feedback will be given to the extreme scores (e.g., arbitrarily defined as more than 1.5 points higher or lower than the mean). The judges that gave out those scores will thus have a chance to review or reexamine their decisions. With that warning system in place to enhance inter-rater reliability, the number of judges can thus be reduced to seven. Still, the highest and lowest scores will be deleted from the final calculation.

The current system: 12 raters (3 technical panelists + 9 judges). My suggested system: 12 raters (5 technical panelists + 7 judges).

What can the ISU do for a long-term solution?
1. To facilitate a win-win collaboration with the academic circle: Sometimes graduate students (especially international students who lack local connections) may have a hard time finding a research topic and human subjects. Some may be willing to do the research for free if the ISU makes such opportunities available. Skaters may be willing to participate in a test event if they get to meet the judges and receive extra feedback on their skills. The ISU may arrange such events right after a competition when many skaters and judges already gather in one place.
2. To explore the option of holistic scoring: Holistic evaluation, if well-designed, can be more reliable than "objective methods". For instance, although TOEFL Speaking section is rated by holistic impression, it has a reliability estimate of 0.90, higher than the objective multiple choice Reading test (0.86) and the Listening test (0.87), and has a standard error of measurement (SEM) of 1.7, lower than Reading (2.78) and Listening (2.40). Holistic grading is more flexible in rewarding creativity or elements that do not strictly follow the prescribed feature guidelines. It may reduce the drawback of the current system that every skating program looks the same. I have been wondering if it is indeed necessary for the ISU technical panel to count the executed features. It is necessary now because two independent judgments on one element are not always ensured.

skatinginbc · Feb 19, 2012

This is what they say about artistic gymnastics Code of Points in Wikipedia:
"In 2006, the Code of Points and the entire gymnastics scoring system were completely overhauled. The change stemmed from the judging controversy at 2004 Olympics in Athens, which brought the reliability and objectivity of the scoring system into question, and arguments that execution had been sacrificed for difficulty in artistic gymnastics."
"Since its inception in major events in 2006, the Code has faced strong opposition from many prominent coaches, athletes and judges. Proponents of the new system believe it is a necessary step for the advancement of gymnastics, promoting difficult skills and more objective judging. Opponents feel that people outside the gymnastics community will not understand the scoring and will lose interest in gymnastics, and that without the emphasis on artistry, the essence of the sport will change. Many opponents of the new scoring system feel that this new scoring system, in essence, chooses the winners before the competition ever begins. Competitors no longer compete on the same level. Each contestant begins with a unique start value; therefore, contestants assigned a lower start value or difficulty rating are knocked out of the winner's circle before the competition begins. They may compete, but they cannot win. A competitor with a higher difficulty rating will begin the competition with a much higher score...There has also been concern that the new Code strongly favors extreme difficulty over form, execution and consistency. At the 2006 World Championships, for instance, Vanessa Ferrari of Italy was able to controversially win the women's all-around title in spite of a fall on the balance beam, in part by picking up extra points from performing high-difficulty skills on the floor exercise.The 2006 Report of the FIG's Athletes' Commission, drafted after a review and discussion of the year's events, noted several areas of concern, including numerous inconsistencies in judging and evaluation of skills and routines. However, the leadership of the FIG remains committed to the new Code...."

Sounds familiar? That clearly tells us that the Code of Points is intrinsically flawed in design, no matter if it is applied to Figure Skating or Artistic Gymnastics.
(1) It does not necessarily improve reliability
(2) It does not necessarily improve objectivity
(3) It kills artistry and changes the essence of the sport
(4) People complain about winning with a fall. "They may compete, but they cannot win". If you don't believe it, just read through the thread "Can Takahashi Close The Gap On Patrick Chan", and you shall figure out the consensus.
(5) It is too complicated for the casual fans to understand.
(5) The leadership (the ISU and FIG) remains committed to new Code despite all criticisms. There must be something good behind the closed doors.

Olympia · Feb 19, 2012

Golly, that's demoralizing. It sure does make gymnastics sound like figure skating, too.

ImaginaryPogue · Feb 19, 2012

skatinginbc, do you believe that it's inherent in any "add the points system" to negate/diminish quality as related to difficulty? I obviously am pro-COP, but if it's inherently flawed and impossible to improve, I'd be happy to see 6.0 (or whatever) brought back. Granted, I'd stop watching the sport, but that's not a big deal to me.

Mathman · Feb 19, 2012

I don't understand the Wikipedia article. Are they saying that in 2006 the old 10.0 point scoring system was abandoned and a CoP-type system was put in its place? Or that the CoP has been in place for a long time, but it was substantially revised in 2006 because of complaints.

"Since its inception in 2006 the code..." This seems to say that before 2006 it was the old 10.0 system. In that case it is the 10.0 system, not the CoP, that "brought the reliability and objectivity of the scoring system into question," and raised "arguments that execution had been sacrificed for difficulty in artistic gymnastics."

Wikipedia articles can be tricky. Whoever contributed this article is an opponent of the CoP in gymnastics. A different contributor would write a different article.

skatinginbc · Feb 19, 2012

Mathman said:
"Since its inception in 2006 the code..." This seems to say that before 2006 it was the old 10.0 system. In that case it is the 10.0 system, not the CoP, that "brought the reliability and objectivity of the scoring system into question," and raised "arguments that execution had been sacrificed for difficulty in artistic gymnastics."

The term "Code of Points" is confusing because the 10-scale old system in gymnastics, based on my quick internet search, is also called "Code of Points". The new judging system Code of Points is like the Figure Skating one, with the extensive guide to the difficulty value assigned to every move and combination of moves. That's why they say "overhauling its code of points".

My interpretation of the article is: The old 10.0 system (like the Figure Skating 6.0 system) was criticized for its lack of reliability and objectivity and for not enough emphasis in execution. And the results from the NEW Code of Points: Inconsistency remains and execution is further de-emphasized.

ImaginaryPogue said:
skatinginbc, do you believe that it's inherent in any "add the points system" to negate/diminish quality as related to difficulty?

I have no problem with "adding the points system". I have problems with "objectively" counting features for an integrated skill. For instance, to evaluate speaking skill, one can "objectively" count the number of grammatical or semantic errors in one's speech. But this is the problem: Is it valid to arbitrarily place one who makes 7 errors in a different level from one who makes 8 errors? It ignores the possibility that Speaker B, despite having more errors, might have compensated that with a better communication skill. Counting the rotation of a jump is acceptable because it is pretty much one dimension and involves minimum creativity. Counting twists and turns in a footwork section, in my opinion, is a different story.

Globe and Mail: Figure skating judging system still has flaws

Serious Business

hurrah

skatinginbc

gkelly

hurrah

skatinginbc

Blades of Passion

Skating is Art, if you let it be

KKonas

hurrah

gkelly

skatinginbc

hurrah

gkelly

hurrah

skatinginbc

skatinginbc

Olympia

ImaginaryPogue

Mathman

skatinginbc

Similar threads

Connect with us