What a long, strange trip it’s been

The Presentation inside:

Slide 0

What a long, strange trip it’s been R.V.Guha Google schema.org

Slide 1

Outline of talk The context How did we end up where we are Schema.org What it is, status of adoption Schema.org principles, how does it work Looking ahead Next Generation Applications schema.org

Slide 2

About 18 years ago, … People started thinking about structured data on the web A few people from Netscape, Microsoft and W3C got together @MIT Trying to make sense of a flurry of activity/proposals XML, MCF, CDF, Sitemaps, … There were a number of problems PICS, Meta data, sitemaps, … But one unifying idea schema.org

Slide 3

Context: The Web for humans HTML schema.org

Slide 4

Goal: Web for Machines & Humans schema.org

Slide 5

What does that mean? Notable points - Graph Data Model - Common Vocabulary schema.org

Slide 6

How do we get there? How does the author give us the graph Data Model: Graph vs tree vs … Syntax Vocabulary Identifiers for objects Why should the author give us the graph? schema.org

Slide 7

Going depth first Many heated battles Lot of proposals, standards, companies, … Data model Trees vs DLGs vs Vertical specific vs who needs one? Syntax XML vs RDF vs json vs … Model theory anyone We need one vs who cares vs what’s that? schema.org

Slide 8

Timeline of ‘standards’ ‘96: Meta Content Framework (MCF) (Apple) ’97: MCF using XML (Netscape) ? RDF, CDF ’99 -- : RDF, RDFS ’01 -- : DAML, OWL, OWL EL, OWL QL, OWL RL ’03: Microformats And many many many more … SPARQL, Turtle, N3, GRDDL, R2RML, FOAF, SIOC, SKOS, … Lots of bells & whistles: model theory, inference, type systems, … schema.org

Slide 9

But something was missing … Fewer than 1000 sites were using these standards Something was clearly missing and it wasn’t more language features We had forgotten the ‘Why’ part of the problem The RSS story schema.org

Slide 10

’07 - :Rise of the consumers Yahoo! Search Monkey, Google Rich Snippets, Facebook Open Graph Offer webmasters a simple value proposition Search engines to webmasters: You give us data … we make your results nicer Usage begins to take off 1000x increase in markup’ed up pages in 3 years schema.org

Slide 11

Yahoo Search Monkey Give websites control over snippet presentation Moderate adoption Targeted at high end developers Too many choices schema.org

Slide 12

Google Rich Snippets: Reviews schema.org

Slide 13

Google Rich Snippets: Events schema.org

Slide 14

Google Rich Snippets Multi-syntax Adhoc vocabulary for each vertical Very clear carrot Lots of experimentation on UI Moderately successful: 10ks of sites Scaling issues with vocabulary schema.org

Slide 15

Situation in 2010 Too many choices/decisions for webmasters Divergence in vocabularies Too much fragmentation N versions of person, address, … A lot of bad/wrong markup ~25% for micro-formats, ~40% with RDFA Some spam, mostly unintended mistakes Absolute adoption numbers still rather low Less than 100k sites schema.org

Slide 16

Schema.org Work started in August 2010 Google, Yahoo!, Microsoft & then Yandex Goals: One vocabulary understood by all the search engines Make it very easy for the webmaster It is A vocabulary. Not The vocabulary. Webmasters can use it together other vocabs We might not understand the other vocabs. Others might schema.org

Slide 17

Schema.org: Major sites News: Nytimes, guardian.com, bbc.co.uk, Movies: imdb, rottentomatoes, movies.com Jobs / careers: careerjet.com, monster.com, indeed.com People: linkedin.com, Products: ebay.com, alibaba.com, sears.com, cafepress.com, sulit.com, fotolia.com Videos: youtube, dailymotion, frequency.com, vinebox.com Medical: cvs.com, drugs.com Local: yelp.com, allmenus.com, urbanspoon.com Events: wherevent.com, meetup.com, zillow.com, eventful Music: last.fm, myspace.com, soundcloud.com schema.org

Slide 18

Schema.org principles: Simplicity Simple things should be simple For webmasters, not necessarily for consumers of markup Webmasters shouldn’t have to deal with N namespaces Complex things should be possible Advanced webmasters should be able to mix and match vocabularies Syntax Microdata, usability studies RDFa, json-ld, … schema.org

Slide 19

Schema.org principles: Simplicity Can’t expect webmasters to understand Knowledge Representation, Semantic Web Query Languages, etc. It has to fit in with existing workflows A posteriori ‘markup tools’ don’t work Avoid KR system driven artifacts Multiple domain / range for attributes No classes like ‘Agent’ Categories and attributes should be concrete schema.org

Slide 20

Schema.org principles: Simplicity Copy and edit as the default mode for authors It is not a linear spec, but a tree of examples Vocabularies Authors only need to have local view But schema.org tries to have a single global coherent vocabulary schema.org

Slide 21

Schema.org principles: Incremental Started simple ~ 100 categories at launch Applies to every area Add complexity after adoption now ~1200 vocab items Go back and fill in the blanks Move fast, accept mistakes, iterate fast schema.org

Slide 22

Schema.org Principles: URIs ~1000s of terms like Actor, birthdate ~10s for most sites Common across sites ~10ks of terms like USA External enumerations ~1b-100b terms like Chuck Norris and Ryan, Oklahama Cannot expect agreement on these Reference by description Consumers can reconcile entity references schema.org

Slide 23

+ = USA schema.org

Slide 24

Schema.org Principles: Collaborations Most discussions on public W3C lists Work closely with interest communities Work with others to incorporate their vocabularies We give them attribution on schema.org Webmasters should not have to worry about where each piece of the vocabulary came from Webmasters can mix and match vocabs schema.org

Slide 25

Schema.org Principles: Collaborations IPTC /NYTimes / Getty with rNews Martin Hepp with Good Relations US Veterans, Whitehouse, Indeed.com with Job Posting Creative Commons with LRMI NIH National Library of Medicine for Medical vocab. Bibextend, Highwire Press for Bibliographic vocabulary Benetech for Accessibility BBC, European Broadcasting Union for TV & Radio schema Stackexchange, SKOS group for message board Lots and lots and lots of individuals schema.org

Slide 26

Schema.org Principles: Partners Partner with Authoring platforms Drupal, Wordpress, Blogger, YouTube Drupal 8 Schema.org markup for many types News articles, comments, users, events, … More schema.org types can be created by site author Markup in HTML5 & RDFa Lite Will come out early 2015 schema.org

Slide 27

Recent Additions From Nouns to Verbs: Actions Object ? potential actions Constraints on actions E.g., ThorMovie ? Stream, Buy, … Introducing time: Roles E.g., Joe Montana played for the SF 49ers from 1979 to 1992 in the position QuarterBack schema.org

Slide 28

Recent Additions Scholarly work, Comics, Serials, … Communications: TV, Radio, Q&A, … Accessibility Commerce: Reservations Sports Buyer/Seller, etc. Bibtex The ontology is growing … ~800 properties ~600 classes schema.org

Slide 29

Looking forward Schema.org is doing better than we expected Thanks to millions of webmasters! But this is not the final goal Just the means to the next generation of applications First generation of applications Rich presentation of search results Many new applications Related to search and beyond schema.org

Slide 30

Newer Applications: Knowledge Graph schema.org

Slide 31

Newer Applications: Knowledge Graph schema.org

Slide 32

Non search applications: Google Now User profile (google.com/now/topics) + structured data feeds schema.org

Slide 33

Pinterest: Schema.org for Rich Pins schema.org

Slide 34

Reservations ? Personal Assistant Open Table website ? confirmation email ? Android Reminder schema.org

Slide 35

Vertical Search Structured data in search Web search: annotate search results OR Filtering based on structured data Only in specialized corpus Ecommerce, real estate, etc. How about filtering based on structured data across the web? schema.org

Slide 36

Google Rich Snippets: Recipe View schema.org

Slide 37

Web scale vertical search Searching for Veteran friendly jobs schema.org

Slide 38

Web Scale custom vertical search Build your own custom vertical search engine Google does the heavy lifting: crawling, indexing, etc. You specify the schema.org restricts APIs to help build your own UI Searches over all pages on the web with a certain schema.org markup Demo schema.org

Slide 39

Scientific Data Publishing US Govt alone spends over $60B/yr on scientific research Primary output of most of this research is data Most of the data is thrown away All that is published are papers We would like the data published in a easily reusable form schema.org

Slide 40

Case study: Clinical Trials Clinical trials 4000+ clinical trials at any time in the US alone Almost all the data ‘thrown away’ All that gets published is a textual ‘abstract’ Many of the trials are redundant Earlier trials have the data Assumptions, etc. cannot be re-examined Longitudinal studies extremely hard, but super important Having all the clinical trial data on the web, in a common schema will make this much easier! schema.org

Slide 41

Case study: SkyServer Huge amount of astronomy data Jim Gray, NASA and others brought it all together, normalized it and made it available on the web Has changed the way astronomy research takes place Students in Africa getting PhDs without leaving Africa! Radio/Ultra-violet/Visible light data easily brought together Caveats SQL biased, not distributed, not scalable All normalization done by hand, once Small number of data sources But shows that it can be done … schema.org

Slide 42

First steps for scientific data publication OPTC directive for data from federally funded research to be freely available Formation of new ‘Data Science’ institute inside NIH Seeing traction in scientific data on the web Lot of interest in creating schemas Public repositories for scientific data starting schema.org

Slide 43

Concluding Structured data on the web is now ‘web scale’ Schema.org has got traction and is evolving The most interesting applications are yet to come schema.org

Slide 44

Questions? schema.org