Towards Baselines for Shoulder Surfing on Mobile Authentication

Given the nature of mobile devices and unlock procedures, unlock authentication is a prime target for credential leaking via shoulder surfing, a form of an observation attack. While the research community has investigated solutions to minimize or prevent the threat of shoulder surfing, our understanding of how the attack performs on current systems is less well studied. In this paper, we describe a large online experiment (n=1173) that works towards establishing a baseline of shoulder surfing vulnerability for current unlock authentication systems. Using controlled video recordings of a victim entering in a set of 4- and 6-length PINs and Android unlock patterns on different phones from different angles, we asked participants to act as attackers, trying to determine the authentication input based on the observation. We find that 6-digit PINs are the most elusive attacking surface where a single observation leads to just 10.8% successful attacks, improving to 26.5\% with multiple observations. As a comparison, 6-length Android patterns, with one observation, suffered 64.2% attack rate and 79.9% with multiple observations. Removing feedback lines for patterns improves security from 35.3\% and 52.1\% for single and multiple observations, respectively. This evidence, as well as other results related to hand position, phone size, and observation angle, suggests the best and worst case scenarios related to shoulder surfing vulnerability which can both help inform users to improve their security choices, as well as establish baselines for researchers.


INTRODUCTION
Personal and sensitive data is o en stored on or accessed via mobile devices, making these technologies an a ractive target for a ackers [17]. In the physical domain, the rst line of defense against a proximate a acker seeking to gain access to the device is the unlock authenticator, the method used to authenticate the device owner to the device, e.g., by entering a 4-digit PIN.
One type of a ack faced when authenticating via a mobile device is shoulder sur ng, a form of an observation a ack by which an a acker a empts to observe the authenticator of a victim while the authenticator is being entered on the device [43]. One of the most cited dangers for smartphone unlocking mechanisms are shoulder sur ng a acks [28].
While many users utilize biometric authentication as a supplement to the dominant PIN and graphical (stroke-based) pa ern password entry mechanisms, this does not provide universal protection from shoulder sur ng. Biometrics are a promising advancement in mobile authentication, but they can be considered a reauthenticator or a secondary-authentication device as a user is still required to have a PIN or pa ern that they enter rather frequently due to environmental impacts (e.g., wet hands). ere are also known to be high false negatives rates associated with biometrics [7]. Further, users with biometrics o en choose weaker PINs as compared to those without [10], suggesting that the classical unlock authentication remain an important a ack vector going forward.
ere is much related work that both proposes and studies shoulder sur ng resistant authentication mechanisms [11-14, 17, 19, 20, 25, 28], but research related to understanding the susceptibility to shoulder sur ng of currently used unlock authentication, namely PINs and Android graphical pa ern unlock, is limited in nature. Further, as researchers propose methods and authentication schemes that o er protections from shoulder sur ng a acks, we do not have clear baselines of comparison for improvement (or lack thereof) to current schemes.
In this paper, we report the results of a comprehensive study of shoulder sur ng based on video recordings of a victim authenticating. Our participants, upon viewing the videos, were asked to recreate the authentication sequences, simulating basic shoulder arXiv:1709.04959v2 [cs.CR] 23 Sep 2017 sur ng a acks. While prior work has considered visual observations of Android graphical passwords, such as smudge a acks [6] and animated tracing [39], prior research only considered a single dimension. We a empt to account for multiple conditions.
• Authentication Type: we compared 4-and 6-digit PINs, and 4-and 6-length Android graphical pa erns, with visible line feedback and without. • Observation Angle: we considered 5 di erent observation angles based on videos recorded simultaneously during authentication. • Repeated Viewing: we consider situations where the participant has a single view of authentication or multiple views. • Phone Size: we consider two di erent touchscreen sizes that are common in today's market. • Hand Position: we considered two di erent hand positions to interact with the device, single handed thumb input, and two handed index-nger input.
We constructed a comprehensive web-based survey and recruited participants locally from our institution (n = 91) and online via Amazon Mechanical Turk (n = 1173) for a mixed-factorial subject study. Participants, acting as a ackers, were presented with a set of randomized conditions and asked to view a video of an authentication. ey then a empted to recreate the authentication.
Analyzing the results, we nd that in all se ings, Android's graphical pa ern unlock is the most vulnerable, especially when feedback lines are visible; a single observation successfully a acked the pa ern 64.2% of the time with 79.9% for multiple observations of a 6-length pa ern. Shorter pa erns were even more vulnerable. Removing feedback lines during the pa ern entry improved the security, nding 35.3% successful a acks with a single view and 52.1% success with multiple views for 6-length pa erns (con rming prior work [39]). PINs, however, proved much more elusive to a ack than anticipated. A single observation was su cient to a ack just 10.8% of the 6-digit PINs, degrading to 26.5% a er two observations.
ese results support what we as a community have believed to be true anecdotally, and further demonstrates that current authentication methods provide stronger security against shoulder sur ng than one might expect. Further, these results suggest that baselines of shoulder sur ng success can be applied to this space, to be er support mobile authentication users. Future work should allow for improvements over the current worst se ings and best se ings for shoulder sur ng.

BACKGROUND AND RELATED WORK
Mobile Authentication Unlock Choices. In order to secure access to a mobile device, users are able to use three main mechanisms to unlock the screen.
• PIN based authentication, sometimes referred to as a passcode on iPhones: where a user is asked to recall a PIN of a least four digits. Newer iPhones, however, require a 6-digit passcode [18]. (A sample PIN layout, as used in our experiments, is shown in Figure 1.) • Pa ern based authentication: where a user is asked to recall a gesture that interconnects a set of 3x3 contact Figure 1: Pattern contact points (le ), with label indexing beginning at 0, ending at 8, and a PIN layout (right) with digits 0-9, an OK button, backspace button and display screen points. On the Android OS, four or more points should be selected. e user must maintain contact through the authentication, may not reuse contact points, nor jump over points previously un-contacted in connecting two points. Figure 1 shows the grid layout for pa erns, as well as our labeling scheme of starting with index 0 through 8. For example, the L-shaped pa ern would be 03678. Additionally, pa ern based authentication occurs in two avors. e traditional se ing is that visual feedback line is displayed as the user traverses the contact points (so called, with-lines). e second version requires the user to do the same input, but the feedback or tracing lines are not displayed (so called, without-lines). In prior work, it has been suggested that the without-lines version of the Android pa ern is more secure from observation a acks, like shoulder sur ng [39].
• Password based authentication, sometimes called an alphanumeric passcode: where a user is asked to recall a standard text-based password (entered using a so -keyboard) to unlock the device.
e usability and security of PINs [9,40], pa erns [4,36] and passwords [22,29] have been well documented by researchers. Beyond these methods, picture-based [46] and biometric based mechanisms are also used to unlock mobile devices. e la er is becoming more commonplace. Fingerprint readers (e.g. TouchID on iPhone v5 or later) and face identi cation (through apps such as FaceCrypt or FastAccessAnywhere) can be used to verify the identity of the user. Biometrics are o en utilized as a secondary authentication method, and a user with biometric authentication enabled must also have a PIN set. While biometrics o er promise to promoting quick authentication, threats related to spoo ng and the vulnerabilities associated residual information le on sensors by victims can pose challenges to users [35]. In this study, we focus on two most widely used authentication mechanisms [17,21], PINs and graphical Android pa erns, with-and without feedback lines.
Shoulder sur ng vulnerability. Numerous types of a acks exist where mobile authentication sequences can be obtained and used by third parties (e.g., simple guessing, smudge a acks, malware a acks [17]). Mobile device users are particularly susceptible to observational a acks, as these devices are used in a range of public and unfamiliar environments where threats may be present. Inputs can be observed and recreated. Furthermore, accessibility features such as magni cation of the typed character or displaying the last typed character as cleartext in the password entry eld may compromise security [32].
A acks may be performed through direct observation (potentially enhanced through binoculars or low-power telescopes), or through the use of recording devices (e.g. video cameras for later playback) which can be used to covertly obtain or infer credentials [23,43]. Even if the user a empts to shield the screen from onlookers, security may be compromised through eavesdropping; listening to secure information which can later be used for purposes of recreating entry to a mobile device. Research reveals that human adversaries, even without recording devices, can be more e ective at eavesdropping than expected, in particular by employing cognitive strategies and by training themselves [26].
Mechanisms to minimize occurrences of shoulder sur ng. Solutions to reduce shoulder sur ng include methods of obscuring entry (e.g., through the use of screen lters, such as Amzer Privacy Shield, described by [45]), limiting the ability of third parties to view authentication stimuli input from a speci c angle. Drawmetric solutions also exist where input is made on the back and/or on the front of the mobile touchscreen device, obscuring the onlooker's view (e.g. the XSide system [13]). Other drawmetric approaches utilize behavioral biometrics, which can provide an additional authentication factor, to verify the user [37].
Decoy or randomization scenarios have also been proposed [38,45], where, even a er an observation, it challenges observers in recreating the authentication because he/she cannot di erentiate between true and random input. Touch sensitivity can also be e ective. A prescribed level of pressure during input is di cult for an a acker to recreate [27]. Similarly, unobservable, tactile feedback can also be used to thwart a shoulder surfer [1,15,24], where the device informs the user which of a set of passwords (or nonces) to expect.
Kim et al. [23] suggest that current approaches to reducing shoulder-sur ng typically also reduce the usability of the system; o en requiring users to use security tokens, interact with systems that do not provide direct feedback or require additional steps to prevent an observer from easily disambiguating the input to determine the password [30,43]. Bianchi and Oakley [8] suggest that authentication becomes a di cult, challenging task as some systems targeting security against malicious a ackers typically place high demands on users. Wiese and Roth [44] highlight the di culties in ascertaining the e cacy of shoulder-sur ng-resistant technologies due to the lack of comparative studies. e researchers have highlighted that as set-ups and assumptions made vary by author, it can be di cult to determine the security and usability of solutions.
Additionally, most of these studies do not compare directly to the current state of the art in mobile authentication, namely, how well do PINs or pa erns (or other current methods) perform under a ack. We a empt to ll in that gap here by providing some baselines for what level of security to expect from current authentication choices.
Evaluations of shoulder sur ng using video recordings. According to [44], in order to determine the resistance of an interface to shoulder-sur ng, the three main methods used by researchers include: (a) participants are cast into the roles of adversaries and users, where adversaries observe authentication sessions of users; (b) an expert adversary observes the authentication sessions of all participants; or, (c) participants are cast into the role of adversaries and observe authentication sessions of an expert user. While each method has its own advantages and disadvantages, considerations should be made regarding learning, motivation and aptitude, to develop a more reliable perception of risk.
Most related to this work is when researchers present participants with sets of video recordings depicting actors a empting to authenticate entry. Recordings generally aim to simulate an over the shoulder view. Se ing up the videos in this manner ensured that the a ackers would not be a ected by inconsistency caused by the target [13,34]. In prior work, the choice of number of observations appears to be arbitrary in nature [44]. Schaub et al. [33] aimed to determine how participants fare when a empting to recreate authentication sequences, comparing those watching video footage vs live a empts (i.e., physically viewing over a user's shoulder). Findings revealed that the success rate of video observations are lower for almost all schemes than the respective live results, with a few exceptions deviating by only 1-2 observations. Wiese and Roth [44] recommend preferring live observations to study human shoulder sur ng unless good reasons speak in favor of using video.
As our study a empted to perform a large-scale, controlled study to systematically compare the two authentication methods, we opted to use video recordings of a single "expert user" being attacked by our participants. is allowed us to perform nely tuned randomizations and make comparisons between conditions. As such, as suggested by prior work [33,44], one can consider these results as lower-bounds on the security. Live observations from the same angle would likely increase the vulnerability to shoulder sur ng.
Baselines and guidance. While researchers have extensively explored ways to address shoulder-sur ng a acks, recommendations have been proposed on ways to design and conduct these types of study. For example, Wiese and Roth [44] recommend rather than arbitrarily selecting a number of observations, that the number of observations made by adversaries should match their assumptions about the scenario and the environment where the scheme will be deployed. Observation strategies should also be taken into account, to gain a more detailed view of feasible strategies. In terms of set-up, Sahami Shirazi et al. [31] propose recording video footage from four di erent angles: front, rear, le and right, in order to compensate the loss of 3D information in 2D videos. While limited detail was provided about the relative positioning of each camera, this type of technique would be useful to be er simulate shoulder sur ng scenarios. Schaub et al. [33] have highlighted di erent ways that users hold and interact with the device. Occlusion by the user's hand and ngers may reduce visibility for shoulder surfers and enhance observation resistance.
As we will describe in the next section, we a empt to account for many of these factors and suggestions. Namely, we apply video recordings from multiple angles, allow for repeated observations and repeated entries, and we also consider di erent form factors and hand positions for our mobile devices.

METHODOLOGY
We designed a mixed-factorial design with both between-and within-subject factors in order to reduce the duration of the study to an acceptable length. Between subjects, we randomized participants into 12 groups based on the authentication type (3-treatments), hand position (2-treatments), and phone type (2-treatments). Within each group, participants were shown a series of videos for a set of 10 authentications. A er each video, each participant a empted to recreate the authentication observed. As part of a within group analysis, the observation angles, the number of observations, and the number of a empts to recreate the authentication were randomized.
Based on this design, we intended to address the following set of hypotheses: • H1: e type of unlock authentication, PIN, Pa ern withlines, Pa erns without-lines, a ects the shoulder sur ng vulnerability. • H2: Repeated viewings of user input increase the likelihood of a shoulder sur ng vulnerability. • H3: Multiple a empts to recreate the input increase the likelihood of a shoulder sur ng vulnerability. • H4: e angle of observations a ects shoulder sur ng vulnerability.
• H5: e properties of the unlock authentication, such as length and visual features, a ect shoulder sur ng vulnerability. • H6: e phone size a ects shoulder sur ng vulnerability.
• H7: e hand position used to hold and interact with a device a ects shoulder sur ng vulnerability.
In the remainder of this section, we outline the se ings of our experiment and the design choices made. We rst discuss the se ings of our video recordings that dictate the participant groups, following which we discuss the password/PINs used in the experiments, how they were selected, and the properties they exhibit. Finally, we discuss the survey mechanisms, training, and other procedures.

Video Recording Settings
Phone settings. We used two phones in our experiments: Nexus 5 and the OnePlus One. e Nexus 5 is a mid-range size phone, with a 5" display. e OnePlusOne has a larger form factor of 6" (compared to 5.4" of the Nexus 5) with a screen size of 5.5". Both phones have the same resolution of 1080x1200 pixels. ese two phones are similar to a wide variety of displays and form factors available on the market today, for both Android and iPhone. In charts and tables, we refer to the phones by their coloring, red for the Nexus 5x and black for the OnePlus One. e goal of using these two phones is to understand how larger form factors, which provide more viewable space, may a ect the a ackers ability to shoulder surf (H6). ere are also side e ects for a larger display that we did not anticipate. For example, in Figure 2, with the larger OnePlus One phone, we experienced more glare on the screen as it was a bit more unwieldy. Being larger in the hand, the OnePlus phone moved more during PIN/Pa ern entry, particularly one-handed, which caused more opportunities for glare.
Nexus 5 (red phone) 5.43"x2.72" form factor, 4.95" display, 1080x1920 umb Index OnePlus One (black phone) 6.02"x2.99" form factor, 5.5" display, 1080x1920 umb Index Figure 2: Phone Types and Hand Positions: top is the Nexus 5x phone and bo om is the OnePlus One phone. e Nexus 5x is roughly the size of a iPhone 6s and the OnePlus one is roughly the size of a iPhone 6s+. On the le is single handed entry, using the thumb only, and on the right is two handed entry using the index nger.
Hand positions. We investigated two di erent phone-grips (or hand positions) for authentication entry. Figure 2 shows the grips. e images on the top-le and bo om-le show a single handed grip being used, where the thumb is used to enter the authentication. e images at the top-right and bo om-right, the grip is a two handed grip, where the user holds the phone in their le hand and enters the authentication sequence using the index-nger of their right hand. ese are both common grip se ings for mobile devices [16]. We focus exclusively on right handed entry modes to reduce the complexity of our experiment. In charts and tables, we describe these two hand positions as thumb for the single handed grip with thumb entry, and index for the two handed grip with index nger entry.
We applied these two conditions because we hypothesized (H7) that visual obstructions may impact the vulnerability to shoulder sur ng. For example, using an index nger provides the least obstructed view, compared to using the thumb, but it also may increase point-of-view obfuscation where it may appear that contact is being made with the phone, when it is only an illusion due to the the angle of observation.
Angles of recording. We used a camera array to simultaneously record each authentication (e.g., one phone type, one hand position type, one authentication input) from multiple angles. e camera array is shown in Figure 3. e target user, who is seated for the study, is subject of observations from ve angles in the camera array to simulate di erent vantage points. Outlined in Table 1, the angles are, from each side le and right with a far and near angle. We also had a top angle with a vantage immediate overhead of the target user. In charts and tables, we shorthand these angles as: nl   for near le , for far le , t for top, nr for near right, and fr for far right.
We hypothesized that there may exist se ings of observations that both hinder and enhance the a ackers ability to shoulder surf (H4). For example, observations from one side over the other (e.g., le v. right) may provide more or less obstructed views, aiding or hindering shoulder sur ng.
Editing Videos. During video recordings, we a empted to make each authentication occur over a consistent length of time with  a consistent hand motion. We further a empted to remove any distractions from the observation area so that participants can focus directly on the task of shoulder sur ng. Each video recording, is about 3-5 seconds in length, but this creates a tracking challenge for the participant who needs to quickly determine where to look in a video (occurring from di erent angles each time) to do the observation. To alleviate this burden, we edited the videos by placing a "focus zone" in the video. See Figure 4 for a visual of this editing. Except for the authentication area, the remainder of the screen is set with a transparent gray so that the participant can quickly determine where to focus their visual a ention for the observation task.

Authentication Settings
As previously mentioned, we aim to analyze two di erent authentication se ings, PINs and Android graphical pa ern unlock. Within the Android pa ern se ings, we also consider se ings where the tracing lines are either displayed or not displayed. Recent work has suggested that tracing lines should not be displayed for improved security [39]. For each authentication se ing, we have chosen a set of 10 representative PINs and pa erns that have spatial shi ing properties and visual, complexity properties, such as crosses.
In the remainder of this section, we outline how that selection was performed and justify the properties used during selection. Additionally, we describe the application used for performing input and how it was designed to fairly compare the two authentication types.
Pattern Selections. e pa erns used in our experiment are shown in Table 2 (graphical representations are presented in the Appendix).
ese pa erns were culled from a set of self-reported pa erns collected through an online study [4], and provided to us for analysis and use. From these pa erns, we identi ed ve 4-length pa erns and ve 6-length pa erns that exhibited a broad set of representative features.
To determine which features to consider, we hypothesized that there may be locations in the grid space that increase or decrease the e ectiveness of the a ack (H5), as well as complexity features of pa erns [2,3]. We were guided by related work [5,41] in choosing the features, for both spatial aspects and complexity properties. PIN selections. In order to select PINs, we followed related work in analyzing digit sequences in password datasets [9]. Using the RockYou dataset 1 , we extracted 4-and 6-length digit sequences that exhibited similar properties to that of the pa ern dataset. e idea being that these digit sequences are likely to be reused as PINs if they appear in passwords.
Matching the PINs to the exact features in the pa erns is not perfect, as not all digit sequences found in pa erns exist within the RockYou dataset, and further, we wish to include all 10 digits (pa erns only use 9 contact points). PINs also have a feature that pa erns cannot have, repeated digits, so we wish to include PINs with this property, either a single digit or multiple repeated digits. e nal set of PINs selected are available in Table 2, and a visual is provided in Appendix B.2.
Authentication Applications. Another important factor to consider is the applications used for entering the authentication. Critically, the size of each application should be the same and have similar visual properties, so as not to advantage one over the other for shoulder sur ng. To this end, we designed two Javascript applications using HTML5 that ran in the Android Chrome browser, setup as a home screen link to simulate a standalone application. Each application mimicked the input used on the device, following the same rules. During the survey, the same applications would be used as embedded Javascript in their browser for the participants to recreate the authentication observed.
For pa erns with-line feedback, a er the target user completed the application, there would be a brief, 200 ms pause before the screen would go to blank/black screen.
is is to simulate the unlock process on the phone. A similar action occurs for pa erns without-line feedback, however, no tracing lines or circled contact points would be seen. 1 Originating from a debunk music sharing web site, the RockYou dataset was leaked in 2009 and contains over 32 million passwords commonly used by researchers [42].
For PINs, the input text area would show the number that was pressed, but would fade to an asterisk a er one second or a er the next number was pressed, similar to how unlock authentication works on smartphones. Only a er pressing "ok" would the screen go blank, simulating an unlock.

Survey Protocol
We designed a protocol around the video recordings by which participants would be assigned a randomized group, receiving training relevant to that group, and then a empt to shoulder surf 10 authentications based on observing videos under di erent se ings. e survey was designed as a web application using a combination of PHP, Javascript, and a MySQL backend. e survey was posted on Amazon Mechanical Turk and participants were also recruited locally at our institution to ensure consistency. e survey protocol proceeded as follows: (

1) Informed Consent (2) Demographic and Background Information (3) Training (4) Observations and Recreation (5) A ention Check and Submission
In the remainder of this section we outline each of these survey segments in detail, as well as the randomization and recruitment process.
Informed Consent and Preliminary Instructions. is survey was approved by our institutional oversight board (IRB), and so we require participants to provide informed consent. For online participants, this was done digitally, and for in-person participants, it was done in a traditional manner, following a script. e informed consent also provided participants with an overview of the experiment, its goals, and initial instructions. For example, it informed participants that they were participating in a research project about shoulder sur ng, as well as directions about the procedures: e survey will request that you maximize the browser window on your screen. You are not permi ed to record the survey or any of its content. e use of pen and paper to write anything down is also strictly prohibited. e survey will request that you watch several videos of a user authenticate into a mobile device. You are to watch the video and a empt to recreate the PIN or pa ern you viewed being entered.
Demographic and Background Information. Following acknowledgment of the informed consent, we ask a series of demographic questions. Including: Additionally, we recorded the screen size of the browser, in pixels, to test if participants were following directions as well to get a sense of the di erent viewing scenarios.  Between Treatment Randomization and Training. At this point in the survey, we randomize the treatments as the remainder portion is dependent on that randomization. We initially randomize into 12 between subject treatment groups: • Authentication Type: PIN, pa erns with feedback lines, or pa erns without feedback lines • Hand Position: index or thumb • Phone Type: either the red Nexus 5 or the black OnePlus One Based on the selection, we prepared three training videos that explained the procedure further speci cally for each authentication type, and then 12 sample test videos that participants can use to practice shoulder sur ng. e test video used the same conditions as the selected treatment, but with a sample PIN (1234) or Pa ern (0123). e training video shows the participant how the observation and recall would proceed (a screen-shot of the video is in Figure 5), and once the video completes, test runs are performed using the sample PIN or pa ern. Participants are allowed to repeat this training video and test runs as many times as needed before continuing to the main portion of the survey.
Within Treatment Randomization for Observation and Recall. At this point, a participant has been assigned an authentication type, phone type, and hand position. ere is now a large set of videos from multiple angles for each of the authentications, but it is not feasible (nor desirable) to display every video to each participant. Instead, we proceed with a within-group randomization to display a subset of those videos under di erent se ings that will support testing hypothesis H2, H3, and H4. e rst stage of randomization is to randomize the order of the authentication that will be displayed. at is, each participant will observe all 10 of the authentications in their selected authentication type, either 10 PINs or 10 pa erns, but the order of those must be randomized to handle training e ects where the participants become be er at the task as time goes on. Once the order is randomized, for each authentication, we then randomize and  counterbalance a set of conditions regarding how many views and a empts a participant gets to make, as outline in Table 3. e "views" refer to how many times a participant gets to view an observation video. For conditions A-C, a random angle is selected, and the participant either gets a single view of that authentication (A,B), or two views from the same angle (C). For conditions D and E, participants get a random rst angle selection, and then are assigned a second angle on the opposite side (e.g., rst angle is a le side, second angle is a ride side). If the top angle was selected, then a random second angle is used. e second part of each condition is the number of a empts. A er viewing the video, the participant can make either one a empt to recreate the authentication or two a empts.
Prior to each video observation, we informed the participant if they were going to view one or two videos and if they would have one or two a empts.
Submission and Attention Tests. Following the survey, we ask participants to report if they used additional aids, such as pen and paper, in helping them complete the procedure. is acts as both an a ention test and a guide for including or excluding results. It also allows us to exclude participants who failed to follow directions. We did not have anyone report that they "cheated. "

Recruitment
We recruited locally at our institution, and online via Amazon Mechanical Turk. e goal of using both recruitment methods is that for the institutionally recruited participants, we can control the se ings, and so we wished to compare these results to those collected online for consistency (see Figure 6). Inconsistent results would suggest that online participants were not taking the survey faithfully. We observed consistent results when comparing similar demographic groups with similar screen resolutions, as described later, suggesting that participants online took the survey in the intended ways. Although, there was some degradation of performance, which may be accounted for by an observation bias or the  Table 5: Single-vs. multi-view for authentication types broken up based on online and in-person participants. NPAT is pattern without feedback lines and PAT is with feedback lines. Comparing single vs. multi-view, in all categories, was statistically signi cant, as well as in-person vs online.
Hawthorne e ect; local participants, being observed, were more likely to try and perform the task well to appease their observers.
In total, we recruited 91 participants locally at our institution, and 1173 online participants. e demographic information is available in Table 4. e material used in recruitment mimicked that of the informed consent. e text used in posting the task to Amazon Mechanical Turk is provided in Appendix A.

Realism and Limitations
We acknowledge that our experimental methodology has a number of limitations. Foremost, we had to reduce the set of authentication tokens to a reasonable size, namely 10, so that we could maintain a reasonable survey length with a reasonable recruitment size. We a empted to mitigate this e ect by choosing real authentications, as collected in other datasets, that would be representative of authentication choice broadly. We further did not include text-based passwords, which can form an unlock authentication, as we were unable to develop a protocol to fairly compare to the other authenticators.
We were additionally limited in terms of the observation se ings. Our online participants may have used screens that were bigger or smaller than we anticipated. We a empt to manage this limitation by recording the screen size, and, as we will show in this paper, there was an impact on performance with respect to screen size. However, general trend lines remain the same, when we compare the online data to that collected in-person.

Ethical Considerations
is protocol was reviewed and approved by our institution review board to ensure that participants were treated fairly, such as providing informed consent and an option to opt-out. e survey itself does not elicit ethical challenges as participants are not performing actions that increase the risk to others or themselves in regard to shoulder sur ng. It could be argued these participants may be more aware of the risks associated with these a acks a er having participated. e identity of the target victim was protected from participants via obfuscation. Finally, the analysis does not include identi able information about participants.

RESULTS
In this section, we describe the results of the survey by addressing each of the hypotheses outlined earlier. We also provide other insights as available, particularly related to the realism of the experiment. As we move through the results, it is important to note that in some conditions (C and E) participants had multiple a empts to recreate the observed authentication, which we study in more detail later. Unless otherwise noted, we consider a successful a ack if the participants accurately recalled and entered the authentication sequence within either of the a empts. For statistical testing, our data is categorical and binary, as in a participant either correctly recalled and entered an authentication sequence or did not. As such, in two way comparisons of a ack rates, we applied Fisher's Exact Test (or G-test) to test signi cance, and χ 2 test for comparing for multi-factor analysis. Additionally, we perform a L1-penalty logistic regression analysis to determine the impact (or lack thereof) of all se ings of the experiment. A signi cance level of p < 0.05 is used. Finally, unless otherwise stated, each of the tables, when a percentage is displayed, this refers to the rate in which an authentication was successfully recreated in that se ing, a so called success or a ack rate.
Realism of online results. An important question to consider is if online results are consistent with those collected in-person. As evident in Table 5, there is a signi cant performance improvement for those in-person participants (p < 0.005, using χ 2 ). Investigating this phenomenon further, we broke down the participants based  Table 7: Impact of observation angle on shoulder sur ng. Single-view treatments and only multi-view treatments of the same angle are considered (see Table 5 for single-vs. multi-view). Using χ 2 testing, * indicates p < 0.05, * * indicates p < 0.005.
on the width of the screen resolution used while taking the survey, which is a good approximation for the size of their viewing area. ese results are presented in Figure 6, and one can clearly see that as the resolution width increases, so does the performance. As we controlled our in-person computing setup, we know precisely the screen resolution of 990x1840, and further, our in-person participants (being undergraduates) are between the ages of 18-24. When isolating this demographic group, we nd no statistical di erences between the PIN and PAT results, with remaining di erence for pa erns without traceback lines (NPAT). is di erence is likely the result of an observation bias, by which having the researchers present led the participants to want to perform the task "be er." As such, we nd that these results suggest that online participants likely took the survey in the intended manner, and variations in screen size (and other factors) probably realistically mimic the realities of shoulder sur ng in the wild. e remaining results focus solely on the online dataset.
H1: authentication type. We applied both Fisher's exact test and χ 2 test to the data in Table 5 and found all comparisons between authentication type to be signi cant. Focusing on the online results, we nd that the authentication plays a signi cant role. PINs proved the most elusive in all se ings, with combined performance of 32.25% a ack rate. Pa erns with traceback lines was the worst performing, 78.27% a ack rate across all se ings. Removing traceback lines improved results to 58.28%, con rming prior work on this topic [39]. As such, we accept the hypothesis that the authentication type impacts shoulder sur ng vulnerability.
H2: repeated observations. Using the results in Table 5, we nd that there are signi cant di erences between the single-view and multi-view se ings. Looking at both online and in-person results, participants are about 1.3x-1.4x more likely to correctly a ack an authentication if allowed multiple views of the authentication. Later, as we compare all the features, we nd that multiple views, in particular, play an outsized role in the vulnerability of authentication to shoulder sur ng. As such, we accept the hypothesis that multiple observations impact shoulder sur ng vulnerability.
H3: multiple input attempts. Recall that we applied a withingroup randomization by which some participants on some authentication were provided two a empts to input the authentication. e procedure of the survey informed them of this fact, so participants were aware, prior to viewing the video, that they would have multiple recreation a empts. Table 6 shows these results on the right column.
Surprisingly, multiple a empts decrease performance, in all cases. We believe this is because participants, knowing they would have multiple a empts, a empted to "game" the process in a way that actually led them to get the pa ern wrong in both a empt cases. For example, they would pay a ention less well. We accept the hypothesis that multiple input attempts a ect shoulder sur ng, but it decreases performance, unexpectedly. From this result, researchers should consider for similar experiments to either not informing participants how many a empts at recreation, or force participants into a regime of single a empts, requiring more a ention during that single a empt.
H4: observation angle. Table 7 presents the results comparing performance for the di erent observation angles. As we wished to isolate the angle, we only consider treatments where a single observation angle was used. We used a χ 2 -test to determine signi cance factors in these scenarios, indicated with *'s in the table. In nearly all cases, within each authentication type, we found that there are signi cant impacts based on the angle of observation. When performing comparisons in total, we nd that the far-right, far-le and near-right angles showed the most signi cant impact. e far-le angle, in particular, was the most challenging angle, and we believe that this angle provide some obfuscation of when screen touching occurred, making it harder for participant to cleanly determine the location of touch events. As such, we accept the hypothesis that the observation angle a ects shoulder sur ng.
H5: properties of authentication. We rst consider the length of the authentication, the results of which are displayed in Table 5. e length has a large impact. In most cases, it decreased the rate of shoulder sur ng by nearly 50%. While length is far from a perfect approximation for security, it's clear that longer authentication will improve security from observation a ack. We further breakdown the vulnerability of the individual authentications in Table 8. Many of the authentications vary from an expected uniform a ack rate, as observed by using a χ 2 test within each authentication length. However, there does not appear to be a direct Table 8: Individual authentication attack rate. Signi cance tested using χ 2 within authentication of the same length, * indicating p < 0.05 and ** indication p < 0.005, or much less than.
pa ern related to the individual spatial properties of the authentication, additional analysis with more authentication types would be needed to draw strong conclusions regarding these features. As such, we partially accept the hypothesis that the properties of authentication impact shoulder sur ng, while features such as authentication length play a large role, the impact of other features is inconclusive.
H6: phone size. Table 6 shows the result of comparing the two phones in the study. Recall that the Red phone refers to the 5" display Nexus 5, and the Black phone refers to the 5.5" display of OnePlus One. Across all conditions, we nd that there is a signi cant di erence in shoulder sur ng between the two phones. In most cases, the larger Black phone provides less security, except for pa erns (PAT), where the smaller Red phone is more secure. A er reviewing the videos, we noticed that the larger Black phone experiences more glare during this recording which could account for the di erence. Overall, it appears that the larger phones provide less security for shoulder sur ng, and we accept the hypothesis.
H7: hand position. Recall that we examined two di erent hand positions. One hand position (or grip) had the victim use a single hand, entering the authentication with his thumb. e second hand position was two handed, holding the phone in the le hand entering the pa erns with the right index nger. Table 6 shows the results of comparing these two conditions, thumb vs index. e results for comparing PINs showed no signi cant di erence; however there was signi cant, but small, di erences between pa ern entry for the di erent hand positions, as well as a small signi cant di erence overall. While there is a di erence, the impact factor is challenged, so we reject the hypothesis that hand position impacts shoulder sur ng. ese results suggest that researchers can allow for any normal hand position without greatly impacting the results; however, using an index nger provides a more direct view, as opposed to the one-handed thumb blocking portions of the screen) and likely improves results, nominally.
Comparison across features. Finally, we wish to understand how the combination of the features impact the results, asking the question, are there a set of ideal conditions or non-ideal conditions for shoulder sur ng that can form a set of baselines? To accomplish this, we performed a logistic regression across all the features using L1-penalties such that features that have small (or  Table 9: L1-penalty logistic regression using all features, the average of 100 runs of the regression. 68.7% of the data is explained by the regression. e * indicate top ranked coefcients. e model is signi cant. no) e ect can have a coe cient of zero. e results of an average of 100 runs of the regression (there were many di erent minimums) are presented in Table 9. e regression was set-up using a feature set of binary values, with a one indicating the presence of the feature and zero otherwise. e label on the feature was also binary, a zero indicating that shoulder sur ng a ack failed and one indicating success. We trained over each trial of the survey, and the resulting model was able to explain 68.7% of the data and was signi cant.
We can further analyze the coe cients of the features which indicate how much weight they provide to the prediction and also if they increase or decrease the likelihood of shoulder sur ng. Negative values imply greater security to shoulder sur ng, while positive values indicate more vulnerability to shoulder sur ng. As we are using L1 penalty, some coe cients can reduce to zero.
Most surprisingly, the coe cient for NPAT (pa erns without tracing lines) is 0. is makes sense if you consider the fact that being a PIN so greatly reducing the likelihood of shoulder sur ng, while pa erns greatly increase the likelihood. e fact that it is a pa ern without lines is not predictive, in comparison to those other two facts. Further, the highest coe cient is that of shoulder sur ng a pa ern, followed by PINs (in the negative direction). is further supports accepting hypothesis H1.
Among the other coe cients, the length factor and having multiple views of the authentication play a large role in shoulder sur ng a ack rates. e far-le angle proved to be the most challenging for shoulder sur ng, as identi ed earlier while near-le and top were the most bene cial for shoulder sur ng.
Based on these results, we can now identify category of the best case scenario for an a acker performing shoulder sur ng: a acking a pa ern with tracing lines that is of length four when provided multiple with views. Similarly, the worst case scenario is a acking a PIN of length 6 from the far le when provided with just one view. ese two scenarios can provide a baseline to compare new systems that o er protections to shoulder sur ng, as well as help inform users of stronger authentication choices.

CONCLUSION
We presented the results of a large scale, online study of shoulder sur ng for the most common unlock authentication, PINs, pa erns with tracing lines, and pa erns without tracing lines. We nd that PINs are the most secure to shoulder sur ng a acks, and while both types of pa ern input are poor, pa erns without lines provides greater security. e length of the input also has an impact; longer authentication is more secure to shoulder sur ng. Additionally, if the a acker has multiple-views of the authentication, the a acker's performance is greatly improved.
Overall, the goal of this research is to work towards establishing baselines for how current authentication performs against shoulder sur ng, as well as provide insight into se ings of current authentication that can protect users from shoulder sur ng. Based on our analysis, researchers should consider comparing their performance of new systems to the most secure se ing, namely using at least 6-digit PINs with just a single view, as well as to the least secure se ing of using a 4-length pa ern with visible lines with multiple views. Additionally, these results suggest, for users, that 6-digit (or longer) PINs provide the best security from shoulder sur ng.

ACKNOWLEDGMENTS
We thank Courtney Tse for her assistance conducting user studies. is research is funded by the National Security Agency and the O ce of Naval Research (N00014-15-1-2776).