Rosettepedia: エンティティ抽出機能拡張スクリプト


Rosetteはテキストデータから18タイプのエンティティを抽出できます。このエンティティ抽出機能を発展させ、抽出された個々のエンティティに関する情報を同時に得られるスクリプトRosettepedia を、このほど Githubに公開しました。
Rosettepediaは Rosetteのエンティティ抽出とエンティティ関係付けの機能を使用して、テキストのエンティティを抽出し、関連するWikipediaの情報とともに出力します。
—————-

Introducing: Rosettepedia

June 13, 2017

A text analytics recipe for entity extraction enhancement

The Rosette API team is always hard at work devising ways for our users to get more value from their unstructured text data. Last month we published a recipe on our community Github that combined multiple Rosette endpoints to produce document summaries. This month, we’re thrilled to announce “Rosettepedia,” a new recipe that gives users instant access to a wealth of additional information about the entities in their text data.

Rosette’s entity extraction endpoint recognizes and extracts 18 different entity types within your text, but what if Rosette extracts an entity you’re not familiar with yet? Or an entity you recognize but don’t know very much about? The Rosettepedia recipe allows you to enhance your entity extraction results with information from Wikipedia Infoboxes and Wikidata drawn from the MediaWiki API.

How it works

The Rosettepedia script calls Rosette API’s entity extraction and entity linking capabilities,  connects to  publicly available Wikidata entries and automatically returns any relevant information along with the extracted entities, enriching your results while saving you time and effort.

Each entity in Wikidata has an identifier—a “QID”—that uniquely identifies it. The /entities endpoint of Rosette API can resolve or link mentions of entities by assigning the appropriate identifier. For example, Washington (Q1223) refers to the state in the United States, Washington (Q61) refers to the city in the District of Columbia, and Washington (Q23) refers to the first president of the United States. This recipe demonstrates how to look up an entity by its QID, provided by Rosette API, and access additional information provided by the Wikidata knowledge base.

At the time of writing, Rosette API’s entity linking functionality supports four languages: Chinese, English, Japanese and Spanish.

Rosettepedia in action

The simplest way to use the script is to simply pipe in a string:

$ echo "OPEC will meet in Vienna this week." | ./rosettepedia.py -w eng > opec.json
Extracting entities via Rosette API ...
Done!
Augmenting entities via MediaWiki API ...
fetching "en" Infobox/Wikidata for entity: Q7795 (OPEC) ...
fetching "en" Infobox/Wikidata for entity: Q1741 (Vienna) ...
Done!

The script returns the following results:

[

 {

   "type": "ORGANIZATION",

   "mention": "OPEC",

   "normalized": "OPEC",

   "count": 1,

   "entityId": "Q7795",

   "wikipedia": {

     "infobox": {

       "name": "Organization of the Petroleum Exporting Countries",

       "image_flag": "Flag of OPEC.svg",

       "image_map": "OPEC.svg",

       "org_type": "International cartel",

       "membership_type": "Membership",

       "admin_center_type": "Headquarters",

       "admin_center": "Vienna, Austria",

       "languages_type": "Official language",

       "languages": "English",

       "leader_title1": "Secretary General",

       "leader_name1": "Mohammed Barkindo",

       "established": "Baghdad, Iraq",

       "established_event1": "Statute",

       "established_date1": "September 1960",

       "established_event2": "In effect",

       "established_date2": "January 1961",

       "currency": "(US$ /bbl)"

     },

     "wikidata": {

       "website": "http://www.opec.org",

       "image": "OPEC-building-01.jpg",

       "instance": "international organization",

       "category": "Category:OPEC"

     },

     "title": "OPEC",

     "url": "https://en.wikipedia.org/wiki/OPEC"

   }

 },

 {

   "type": "LOCATION",

   "mention": "Vienna",

   "normalized": "Vienna",

   "count": 1,

   "entityId": "Q1741",

   "wikipedia": {

     "infobox": {

       "name": "Vienna",

       "native_name": "Wien",

       "settlement_type": "Capital city",

       "image_flag": "Flag of Wien.svg",

       "image_seal": "Vienna seal 1926.svg",

       "image_shield": "Wien 3 Wappen.svg",

       "shield_size": "80px",

       "image_map": "Wien in Austria.svg",

       "map_caption": "Location of Vienna in Austria",

       "subdivision_type": "Country",

       "subdivision_name": "Austria",

       "leader_party": "SPÖ",

       "leader_title": "Mayor and Governor",

       "leader_name": "Michael Häupl",

       "leader_title1": "Vice-Mayors and Vice-Governors",

       "area_magnitude": "2 chaiz",

       "area_total_km2": "414.65",

       "area_land_km2": "395.26",

       "area_water_km2": "19.39",

       "elevation_m": "151 (Lobau) – 542 (Hermannskogel)",

       "elevation_ft": "495–1778",

       "population_total": "1,867,960",

       "population_as_of": "1. January 2017",

       "population_density_km2": "4326.1",

       "population_metro": "2,600,000",

       "population_blank2_title": "Ethnicity",

       "population_blank2": "61.2% Austrian38.8% Other",

       "population_demonym": "Viennese, Wiener",

       "population_note": "Statistik Austria, VCÖ – Mobilität mit Zukunft",

       "postal_code_type": "Postal code",

       "postal_code": "1010–1423, 1600, 1601, 1810, 1901",

       "website": "www.wien.gv.at",

       "footnotes": "frameless|x30px",

       "blank1_name": "- GDP total (2014)http://ec.europa.eu/eurostat/documents/2995521/7192292/1-26022016-AP-EN.pdf/602b34e8-abba-439e-b555-4c3cb1dbbe6e",

       "blank1_info": "€82 billion/ US$110 billion",

       "blank2_name": "- GDP per capita(2014)http://ec.europa.eu/eurostat/documents/2995521/7192292/1-26022016-AP-EN.pdf/602b34e8-abba-439e-b555-4c3cb1dbbe6e",

       "blank2_info": "€47,300/ US$63,000XE.com average GBP/ USD ex. rate in 2014",

       "timezone": "CET",

       "utc_offset": "+1",

       "timezone_DST": "CEST",

       "utc_offset_DST": "+2",

       "blank_name": "Vehicle registration",

       "blank_info": "W"

     },

     "wikidata": {

       "image": "Collage von Wien.jpg",

       "coordinates": {

         "latitude": 48.20833,

         "longitude": 16.373064,

         "altitude": null,

         "precision": 1e-06,

         "globe": "http://www.wikidata.org/entity/Q2"

       },

       "website": "https://www.wien.gv.at/",

       "instance": [

         "city",

         "capital",

         "city with millions of inhabitants",

         "federal capital",

         "municipality of Austria",

         "place with town rights and privileges",

         "statuatory city of Austria",

         "state of Austria",

         "district of Austria",

         "metropolis",

         "tourist destination"

       ],

       "country": [

         "Austria",

         "First Republic of Austria",

         "Austria-Hungary",

         "Republic of German-Austria",

         "Austrian Empire",

         "Federal State of Austria",

         "Nazi Germany",

         "Habsburg Empire",

         "Archduchy of Austria",

         "Duchy of Austria",

         "March of Austria",

         "Duchy of Bavaria",

         "Allied-occupied Austria"

       ],

       "category": "Category:Vienna"

     },

     "title": "Vienna",

     "url": "https://en.wikipedia.org/wiki/Vienna"

   }

 }

]

As you can see, the Rosettepedia script returns detailed results for OPEC and Vienna, augmenting the attributes that Rosette API normally returns (the entity type, count, and QID) with an additional attribute that contains infobox data and Wikidata.

Another way to use the script is to have Rosette API extract content from a web page by supplying a URL and using the -u/–content-uri option:

$ ./rosettepedia.py -u -i 'https://ja.wikipedia.org/wiki/アメリカスカップ' -w jpn > アメリカスカップ.json
Extracting entities via Rosette API ...
...
Done!
$ jq '.entities[]|select(.entityId == "Q29")' アメリカスカップ.json
{
  "type": "LOCATION",
  "mention": "Español",
  "normalized": "Español",
  "count": 1,
  "entityId": "Q29",
  "wikipedia": {
    "infobox": {},
    "wikidata": {
      "coordinates": {
        "latitude": 40,
        "longitude": -3,
        "altitude": null,
        "precision": 1,
        "globe": "http://www.wikidata.org/entity/Q2"
      },
      "image": "Relief Map of Spain.png",
      "continent": [
        "ヨーロッパ",
        "アフリカ"
      ],
      "instance": [
        "主権国家",
        "国",
        "欧州連合加盟国",
        "国際連合加盟国",
        "欧州評議会加盟国"
      ],
      "category": "Category:スペイン",
      "country": "スペイン"
    },
    "title": "スペイン",
    "url": "https://ja.wikipedia.org/wiki/スペイン"
  }
}

Given the additional information provided by the Wikipedia extended attributes, you can filter down to only those entities that satisfy certain properties. For instance, you can query for only those entities that have geo-coordinates:

$ jq '.entities[]|select(.wikipedia.wikidata|has("coordinates"))' アメリカスカップ.json
...
{
  "type": "LOCATION",
  "mention": "JPN",
  "normalized": "JPN",
  "count": 1,
  "entityId": "Q17",
  "wikipedia": {
    "infobox": {},
    "wikidata": {
      "coordinates": {
        "latitude": 35,
        "longitude": 136,
        "altitude": null,
        "precision": 1,
        "globe": "http://www.wikidata.org/entity/Q2"
      },
      "instance": [
        "主権国家",
        "国",
        "島国",
        "国際連合加盟国"
      ],
      "continent": "アジア",
      "category": "Category:日本",
      "country": "日本"
    },
    "title": "日本",
    "url": "https://ja.wikipedia.org/wiki/日本"
  }
}

 

Try it yourself

With access to Rosettepedia you’re now able to extract information from your text data instead of just entities. Speed up research projects and enhance intelligence analysts’ reports with public data. Have your own knowledge base of customer information or persons of interest? Talk to our customer engineering team about on-premise customization opportunities.

Ready to get started? First, sign up for a free API key (no credit card required) for up to 10,000 calls per month. Next, visit our Community Github for step by step instructions on installing and running the script.

Thought of another way to combine Rosette API endpoints for a new use case? Let us know and we’ll feature you on our blog!