A Journey into Evaluation: From Retrieval Effectiveness to User Engagement

The Presentation inside:

Slide 0

SPIRE 2015 – King’s College London A  Journey  into  Evalua0on:   from  Retrieval  Effec0veness  to   User  Engagement   Mounia Lalmas Yahoo Labs London [email protected]

Slide 1

This talk § Introduction to user engagement § Evaluation in information retrieval (retrieval effectiveness) § From retrieval effectiveness to user engagement (from intra-session to inter-session evaluation) (from small- to large-scale evaluation)

Slide 2

This talk beyond the click beyond relevance towards user engagement

Slide 3

User engagement

Slide 4

What is user engagement? “User engagement is a quality of the user experience that emphasizes the phenomena associated with wanting to use a technological resource longer and frequently” (Attfield et al, 2011) self-report: happy, sad, enjoyment, … physiology: gaze, body heat, mouse movement, … analytics: click, upload, read, comment, share … emotional, cognitive and behavioural connection that exists, at any point in time and over time, between a user and a technological resource

Slide 5

Why is it important to engage users? §  In today’s wired world, users have enhanced expectations about their interactions with technology … resulting in increased competition amongst the purveyors and designers of interactive systems. §  In addition to utilitarian factors, such as usability, we must consider the hedonic and experiential factors of interacting with technology, such as fun, fulfillment, play, and user engagement. (O’Brien, Lalmas & Yom-Tov, 2014)

Slide 6

Online sites differ with respect to their engagement pattern Games Users spend much time per visit Social media Users come frequently and stay long Niche Users come on average once a week e.g. weekly post Service Users visit site, when needed, e.g. to renew subscription (Lehmann etal, 2012) Search Users come frequently and do not stay long News Users come periodically, e.g. morning and evening

Slide 7

Characteristics of user engagement Endurability Aesthetics (Read, MacFarlane, & Casey, (Jacques et al, 1995; O’Brien, 2008) 2002; O’Brien, 2008) Motivation, interests, incentives, and benefits (Jacques et al., 1995; O’Brien & Toms, 2008) Focused attention (Webster & Ho, 1997; O’Brien, 2008) Novelty (Webster & Ho, 1997; O’Brien, 2008) Reputation, trust and expectation (Attfield et al, 2011) Richness and control Positive Affect (O’Brien & Toms, 2008) (O’Brien, Lalmas & Yom-Tov, 2014) (Jacques et al, 1995; Webster & Ho, 1997)

Slide 8

Measuring user engagement Measures   Self-report Questionnaire, interview, think-aloud and think after protocols Physiology EEG, SCL, fMRI eye tracking Attributes   Subjective Short- and long-term Lab and field Small scale Objective Short-term Lab and field Small and large scale mouse-tracking Analytics within- and across-session metrics data science Objective Short- and long-term Field Large scale

Slide 9

Attributes of user engagement § Scale (small versus large) § Setting (laboratory versus field) § Objective versus subjective § Temporality (short- versus long-term) We focus on 1.  Temporality: from intra- to inter-session 2.  Scalability: from small- to large-scale

Slide 10

Evaluation in information retrieval

Slide 11

How to evaluate a search engine Sec. 8.6 § Coverage   § Speed   § Query  language   § User  interface   § User  happiness   ›  ›  Users  find  what  they  want  and  return  to  the  search  engine   Users  complete  the  search  task,  where  search  is  a  means,  not   an  end   (Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)

Slide 12

(Lehmann etal, 2013) Within an online session ›  ›  ›  ›  July 2012 2.5M users 785M page views Categorization of the most frequent accessed sites •  11 categories (e.g. news), 33 subcategories (e.g. news finance, news society) •  760 sites from 70 countries/regions short sessions: average 3.01 distinct sites visited with revisitation rate 10% long sessions: average 9.62 distinct sites visited with revisitation rate 22%

Slide 13

Measuring user happiness Sec. 8.1 Most  common  proxy:  relevance  of  search  results   Retrieved Relevant all items Evaluation measures: •  precision, recall, R-precision; [email protected]; mean average precision; F-measure; … •  bpref; cumulative gains, … §  User  informa)on  need  translated  into   a  query   §  Relevance  assessed  rela0ve  to     informa)on  need  not  the  query   §  Example:   ›  Informa0on  need:  I  am  looking  for  tennis   holiday  in  a  country  with  no  rain   ›  Query:  tennis  academy  good  weather   precision recall

Slide 14

Measuring user happiness Sec. 8.1 Most  common  proxy:  relevance  of  search  result   Explicit signals Test collection methodology (TREC, CLEF, …) Human labeled corpora Implicit signals User behavior in online settings (clicks, skips, …)

Slide 15

Examples of implicit signals in web search §  Number of clicks §  Click at given position §  Time to first click §  Skipping §  Abandonment rate §  Number of query reformulations §  Dwell time

Slide 16

What is a happy user in web search 1.  The user information need is satisfied 2.  The user has learned about a topic and even about other topics 3.  The system was inviting and even fun to use USER ENGAGEMENT In-the-moment engagement Users active on a site or stayed long Long-term engagement Users come back frequently and over a long-term period

Slide 17

Interpreting the signals

Slide 18

Click-through rates CTR new ranking algorithm new design of search result page …

Slide 19

No clicks I just wanted the phone number … I am totally happy J

Slide 20

(Lalmas etal, 2015) Dwell time non-mobile optimized mobile optimized DWELL TIME used a proxy of user experience click on an ad on mobile device Dwell time on non-optimized landing pages comparable and even higher than on mobileoptimized ones … when mobile optimized, users realize quickly whether they “like” the ad or not? Publisher

Slide 21

Relevance in multimedia search Multimedia search activities often driven by entertainment needs, not by information needs (Slaney, 2011)

Slide 22

Explorative or serendipitous search (Miliaraki, Blanco & Lalmas, 2015)

Slide 23

Objectivity versus subjectivity top most popular tweets top most popular tweets + geographical diverse Being from a central or peripheral location makes a difference. Peripheral users did not perceive the timeline as being diverse It should never be just about the algorithm, but also how users respond to what the algorithm returns to them à USER ENGAGEMENT (Eduardo Graells, 2015)

Slide 24

Let us revisit

Slide 25

USER ENGAGEMENT Interactive Information Retrieval (Ingwersen, Human Aspects in IR, ESSIR 2011)

Slide 26

Beyond clicks and relevance towards user engagement § From intra- to inter-session evaluation Dwell time and absence time ›  Linking strategy ›  Mobile advertising ›  happy users come back § From small- to large-scale evaluation Eye-tracking and user engagement questionnaire ›  Mouse tracking and user engagement questionnaire ›  we need to properly identify the happy users

Slide 27

From intra- to inter-session evaluation

Slide 28

From short- to long-term engagement: From intra- to inter-session engagement We know what it will mean proxy intra-session metric(s) how users engage within a session? inter-session metric(s) how users engage across sessions? future engagement We monitor

Slide 29

User engagement metrics

Slide 30

User engagement metrics intra-session metrics •  Dwell time •  Session duration •  Bounce rate •  Play time (video) •  Mouse movement •  Click through rate (CTR) •  Number of pages viewed (click depth) •  Conversion rate •  Number of UCG (comments) •  … Dwell time as a proxy of user interest Dwell time as a proxy of relevance Dwell time as a proxy of conversion Dwell time as a proxy of post-click ad quality … intra-session inter-session

Slide 31

Dwell time § Definition The contiguous time spent on a site or web page § Similar measures Play time (for video sites) § Cons Not clear that the user was actually looking at the site while there à blur/focus (O’Brien, Lalmas & Yom-Tov, 2014) Distribution of dwell times on 50 websites

Slide 32

Dwell time Dwell time varies by site type: •  leisure sites tend to have longer dwell times than news, e-commerce, etc. Dwell time has a relatively large variance even for the same site (tourists, VIP, active … users) (O’Brien, Lalmas & Yom-Tov, 2014) Dwell time on 50 websites

Slide 33

Dwell time across sessions or absence time

Slide 34

The context – search experience

Slide 35

The context – search experience

Slide 36

1.0 Absence time and survival analysis 0.8 Users (%) who read story 2 but did not come back after 10 hours 0.6 SURVIVE 0.0 0.2 0.4 story 1 story 2 story 3 story 4 story 5 story 6 story 7 story 8 story 9 0 Users (%) who did come back DIE 5 10 DIE = RETURN TO SITE èSHORT ABSENCE TIME 15 20 hours

Slide 37

Absence time applied to search Ranking function on Yahoo Answer Japan Two-weeks click data on Yahoo Answer Japan: search One millions users Six ranking functions 30-minute session boundary

Slide 38

Absence time and number of clicks on search result page control = no click survival analysis: high hazard rate (die quickly) = short absence 3 clicks 5 clicks

Slide 39

(Dupret & Lalmas, 2013) Absence time – search experience search session metrics à absence time 1.  No click means a bad user experience 2.  Clicking between 3-5 results leads to same user experience 3.  Clicking on more than 5 results reflects poorer user experience; users cannot find what they are looking for 4.  Clicking lower in the ranking (2nd, 3rd) suggests more careful choice from the user (compared to 1st) 5.  Clicking at bottom is a sign of low quality overall ranking 6.  Users finding their answers quickly (time to 1st click) return sooner to the search application 7.  Returning to the same search result page is a worse user experience than reformulating the query

Slide 40


Slide 41

p(absence 12h) The context – Linking strategy in online news News provider No Click Off-site click Off-site link à absence time Providing links to related off-site content has a positive long-term effect (Lehmann etal, In Progress) Related  off-­‐site  content  

Slide 42

The Context – Mobile advertising Dwell time à ad click 600% ad click difference Positive post-click experience (“long” clicks) has an effect on users clicking on ads again 400% 200% 0% short ad clicks (Lalmas etal, 2015) long ad clicks

Slide 43

Beyond clicks and relevance towards user engagement § From intra- to inter-session evaluation Dwell time and absence time ›  Linking strategy ›  Mobile advertising ›  happy users come back

Slide 44

From small- to large-scale evaluation

Slide 45

Small scale measurement – focused attention questionnaire 5-point scale (strong disagree to strong agree) 1.  2.  3.  4.  I lost myself in this news tasks experience I was so involved in my news tasks that I lost track of time I blocked things out around me when I was completing the news tasks When I was performing these news tasks, I lost track of the world around me 5.  The time I spent performing these news tasks just slipped away 6.  I was absorbed in my news tasks 7.  During the news tasks experience I let myself go (O'Brien & Toms, 2010)

Slide 46

Small scale measurement – PANAS questionnaire (10 positive items and 10 negative items) §  You feel this way right now, that is, at the present moment [1 = very slightly or not at all; 2 = a little; 3 = moderately; 4 = quite a bit; 5 = extremely] [randomize items] distressed, upset, guilty, scared, hostile, irritable, ashamed, nervous, jittery, afraid interested, excited, strong, enthusiastic, proud, alert, inspired, determined, attentive, active (Watson, Clark & Tellegen, 1988)

Slide 47

Small scale measurement – gaze and self-reporting News interest 57 users reading task (114) Three metrics: gaze, focus attention and positive affect •  questionnaire (qualitative data) •  record eye tracking (quantitative data) •  (Arapakis etal, 2014) All three metrics align: interesting content promote all engagement metrics

Slide 48

From small- to large-scale measurement – mouse tracking §  Navigation & interaction with digital environment usually involves the use of a mouse (selecting, positioning, clicking) §  Several works show mouse cursor as weak proxy of gaze (attention) §  Low-cost, scalable alternative §  Can be performed in a non-invasive manner, without removing users from their natural setting

Slide 49

Relevance, dwell time & cursor (Guo & Agichtein, 2012) “reading” a relevant long document vs “scanning” a long non-relevant document

Slide 50

“Ugly” vs “Normal” Interface BBC News Wikipedia

Slide 51

Mouse tracking and self-reporting §  324 users from Amazon Mechanical Turk (between subject design) §  Two tasks (reading and search) §  “Normal vs Ugly” interface §  Questionnaires (qualitative data) ›  focus attention, positive effect ›  interest, aesthetics §  Mouse tracking (quantitative data) ›  movement speed, movement rate, click rate, pause length, percentage of time still (Warnock & Lalmas, 2015)

Slide 52

Mouse tracking could not tell much about •  focused attention and positive affect •  user interests in the task/topic •  aesthetics BUT BUT BUT BUT ›  ›  “ugly” variant did not result in lower USER aesthetics scores although BBC > Wikipedia BUT – the comments left … ›  Wikipedia: “The website was simply awful. Ads flashing everywhere, poor text colors on a dark blue background.”; “The webpage was entirely blue. I don't know if it was supposed to be like that, but it definitely detracted from the browsing experience.” ›  BBC News: “The website's layout and color scheme were a bitch to navigate and read.”; “Comic sans is a horrible font.”

Slide 53

Flawed methodology? Non-existing signal? Wrong metric? Wrong measure? § Hawthorne Effect § Design ›  ›  Usability versus engagement Within- versus between-subject § Mouse movement was not sophisticated enough

Slide 54

Mouse Gestures à Features 6000 x0y0 x3y3 x4y4 x2y2 resting cursor (500ms) resting cursor (1000ms) resting cursor (1500ms) click x8y8 ● ● x7y7 ● ● ● ● ● x6y6 4000 y x1y1 x5y5 t Δt rest (Arapakis, Lalmas & Valkanas, 2014) Δt rest 22 users reading two articles 176,550 cursor positions 2,913 mouse gestures

Slide 55

Towards a taxonomy of mouse gestures for user engagement measurement §  The top-ranked clustering configuration is the Spectral Clustering for the original dataset, with hyperbolic tangent kernel, for k = 38 •  certain types of mouse gestures occur more or less often, depending on user interest in article •  significant correlations between certain types of mouse gestures and selfreport measures •  cursor behaviour goes beyond measuring frustration •  inform about the positive and negative interaction

Slide 56

Beyond clicks and relevance towards user engagement § From small- to large-scale evaluation Eye-tracking and user engagement questionnaire ›  Mouse tracking and user engagement questionnaire ›  we need to properly identify the happy users

Slide 57

Towards user engagement

Slide 58

Towards User Engagement happy users come back we need to properly identify the happy users

Slide 59

Thank you §  “If you cannot measure it, you cannot improve it” William Thomson (Lord Kelvin) §  “You cannot control what you cannot measure” DeMarco §  “The way you measure is more important than what you measure” Art Gust

Slide 60