|
Researcher
profile Extraction Spec
first
draft: by jie tang
May
29, 2007
The data set and related documents are used for researcher
profile extraction, also called researcher
profiling (Tang
et al., 2007; Tang
et al., 2008).
The related data are as follows.
898_data: dataset used in paper, without url
information (898 files)
Notice:
we also annotate block information. See the specification
below.
898_data_url: dataset annotated with url
information (898 files)
all_data: all data annotated, with url and image
information.
Img_898_data: images of 898_data
Img_all_data: images of all_data
id2name.xml: a list of id and person_name. id is the
filename in the dataset
Representative publications:
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li,
Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social
Networks. In Proceedings of the Fourteenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). [PDF]
Jie Tang, Duo Zhang, and Limin Yao. Social
Network Extraction of Academic Researchers. In Proceedings of 2007 IEEE
International Conference on Data Mining (ICDM’2007). pp. 292-301. [PDF]
[Slides]
Following, we give the specification for the
researcher profiling task.
Spec on Researcher
Profile Extraction
We
are developing extraction tools in ArnetMiner, a researcher social network
system. The tool will be used to extract researcher profile from the Web page
and outputs the extracted information into a researcher database.
This
documentation describes the specs of the researcher profile which need to be
extracted at the building process of the researcher network.
All the specs here are focusing on English.
Generally
speaking, a researcher profile is defined as shown in Table 1.
Table 1. Researcher
profile annotation subtasks
Section |
Annotation Subtasks |
|
Basic Information |
Person
Photo |
|
Position |
|
|
Affiliation |
|
|
Research
Interest |
|
|
Homepage |
|
|
Contact Information |
Email
Address |
|
Postal
Address |
|
|
Phone |
|
|
Fax |
|
|
Educational History |
University,
Major, Date, and Advisor for Ph.D. Degree |
|
University,
Major, Date, and Advisor for M.S. Degree |
|
|
University,
Major, Date, and Advisor for B.S. Degree |
|
|
Publication |
Authored
Papers and Technique Reports |
|
Relationship |
Co-authorship |
|
Work-in-the-same-place |
|
|
PC-Member-in-the-same-conference |
|
Each
component in the right column of the Table 1 is defined as a property of a
researcher and may consist of words (standard words like “professor” and
non-standard words like URL and email address).
Although it is
possible, we do not annotate embedded annotation. For example, we prefer to
“ [affiliation]Department
of Computer Science
[address]
CANADA V8W
3P6[/address]”
instead of
“[address]
[affiliation]Department of Computer Science
CANADA V8W
3P6[/address]”
l About tokenization
We should
note that annotated tags should not destroy the structure of the html. For
example, keep <b></b> as what it is. Likewise for <font size
=3></font>, <a href=””></a>. Especially, both font and
bold can be nested, annotation should keep the nested relationship as what it
be.
n
Each
word is separated by word breakers or sentence breakers.
n
The breakers usually are not included in a word,
when they work as separators. E.g., “(student)” à “student”, “A#B” à “A”, “B”, “time/space” à “time”, “space”. But some
breakers will be viewed as a separated ‘word’, e.g., “(”, “,”, “.”, “*”, and a line break.
n
The
html tag “<Image>” will be tokenized as a unique word, e.g., ““<IMAGE
src="defaul3.jpg" alt=""/>””.
n
Words
can be connected by hyphen and underscore symbols: “pre-condition”,
”necessary_condition”.
n
Non-standard
words are defined as below.
u email_body, like
“a.b@163.com”.
u email_pre, like “Email:”.
u Fax_body, like “010
u Fax_pre, like “fax number:”.
u Phone_pre, like “phone number:”.
u Position_body, like “Assistant
Professor”.
u URL, like
“http://keg.cs.tsinghua.edu.cn/”.
u Words
can be acronyms, like I.B.M. although they may contain the breaker “.”.
u “,” can be considered as a
part of a word when the word is a number (e.g. 12,000).
u Some special symbols can be
parts of proper nouns, e.g. “C#”, “Yahoo!”, “P&G”, “P2P” are words.
u Words can be machine generated
strings containing the breakers (e.g., “4#790174ajaj”).
l Person Photo
n
A
person photo is a picture denoted by a tag of “<Image>” in HTML. When
extracting information from the Web, we cleaning all the other tags except the
“<Image>” tag. We also download the picture from the Web so that we can
define content features for the picture by analyzing its colors and size.
n
A
person photo should at least contain the face of the current researcher.
n
A
person photo can also contain several persons' face including the current.
n
A
person photo can be black and white.
l Position
n
Position
represents the current position of the current researcher, but not past ones.
For example, in “He was a professor”, as “was” means that it is a ex-position,
it should be annotated as position.
n
Position
is not the degree title of the current researcher, for example “Dr.”, “Ph.D.”
n
A
position can be “Assistant Professor”, “Co-chair”.
n
A position should only contain the title of a
researcher, without the information of his research area or department. In the
case of “Professor
of Computer Science at Texas A&M University.”, we only annotate “Professor” as position
and annotate “Computer Science at Texas A&M University.” as affiliation.
Another example: “and is now [position]Professor[/position]
of Astronomy and [position]Director[/position] of Graduate Studies
at [affiliation]
n
A
researcher may have more than one position. E.g., one can be the head of a
company research group. He can also
be an adjunct professor at a university.
l Affiliation
n
Affiliation
represents the current affiliation as well, but not past ones.
n
An
affiliation in an address should not be annotated. E.g. in “[address]CSE
Building Room E301University of Florida - P.O.
Box 116120 Gainesville, FL
32611-6120 USA[/address]”, the text “University
of Florida” should not be annotated as affiliation.
n
A text with preceded text “Mail:” or
“Address:” or “Office:”
or “Contact information:” should
be annotated as address although it might be like an affiliation. E.g.
“Department of Computer Science,
n
However,
when an organization co-occurs with an address like “[affiliation]
n
A
researcher may have more than one affiliation.,
n
When it is like this: “I am a [position]Phd
student[/position] at the [/affiliation]Computer Science Dept. of Tel Aviv
University[/affiliation]”, we annotate the information as affiliation instead of
phdmajor and phduniv.
l
Annotate “a [position]member[/position] of [affliation]IEEE,
ACM…[/affliation]”, for example:
I am currently serving on
the [position]program committees[/position] of [affiliation]AAMAS
2007, AMEC 2007, ACM EC 2007, and AAAI 2007[/affiliation].
l Postal Address
n
Address
should be an appearance mail address. For example, room number, building, and
road.
n
Sometime,
a researcher may have a office address in addition to his/her contact address.
Special case: [address]R. Dr. Xavier Sigaud 150, Urca</font>
<font size=2> 22290-180
<font size=2>
<font size=2>Room : [address]3rd Floor, CAT[/address] 39084.txt
l Research Interest (do not annotate
this tag or block)
n
The name of the research topics that the
researchers are interested in. They are may be among the introduction part, in
the format of natural sentences.
For
example, the simple format:
[interests]Machine learning and pattern recognition[/interests]
[interests]Computer vision[/interests], [interests]speech recognition[/interests]
[interests]Programming language system development[/interests], Lush(look
up the dictionary, it means thriving)
In the example above, we consider “,” as separation, indicating
that one subsentence should have a tag.
n The subtitle of the research projects should be annotated as interests.
Research Projects
* <b>[interests]Schema
Mapping Generation[/interests] (Clio).</b>
n The interests contained in the natural language sentences.The terms which are likely to be topics and subtopics of an area should
be annotated as interests. (pay attention to this spec)
I'm an [position]associate professor[/position] at [affiliation]the
l Homepage
n
Not
available
n
We
do not intend to annotate the homepage URL in the web page, because we always
think the web page we find from the Web is just the web page of the current
researcher.
l Email Address (pay attention
to this, all formats of emails)
n
One
can have more than one email address
n
An
email address can be represented in diversity form. Some example are as below.
u hangli at microsoft dot com
u cs.duke.edu, junyang
u ASJMZheng@ntu.edu.sg
u erafalin(at)cs.tufts.edu
n
Some
email address might be represented as a picture. For example, “my email
address: <Image src=’email.jpg’/>”. It should be annotated.
n
Email
like this example should also be annotated, “e-mail: [email]Natalio.Krasnogor
-replace all this by at simbol- nottingham.ac.uk[/email]”
n
Email like
this: “wmt then the at-sign then uci dot edu” should be annotated.
l Phone
n
A
phone number can be cell number, office phone number, home phone number, even
secretary’s phone number of the current researcher.
n
A
researcher can have more than one phone number.
n
A
phone number can a long one (including country area code and extension code) or
a short one (including only part of the phone number “88788
l Fax
n
The
format of the fax number is exactly the same as that of phone number.
n
One
can use the preceding text to disambiguate the fax number from the phone
number.
l Educational History
n
(PhD/MS/BS)
University
u In the case of “I received my MS
and PhD in [phdmajor]Computer Science[/phdmajor] from the [phduniv]
u In
the case of “
u 40923:
information of one individual, two different affiliations.
u
BS, Dept. of [bsmajor]Civil
n
(PhD/MS/BS)
Date
u The date means when the
researcher obtained his/her corresponding degree.
u In general, the date is only
represented as year. However, in some cases, it can include the month “September
u In
the case of “From xx to [date]xx[date], we only annotate the latter “xx”. The
same for 2002.7-[date]2006.7[date]
u If
existed “expected”, do not annotate date
n
(PhD/MS/BS)
Major : “robotics” or subareas of a large area can be annotated as PhD/MS/BS
Major.
n
When a subares cooccurrences with a large area, we
annotate both. For example:
PhD (supervisor: Professor K.
Glover), [phddate]July 1998[/phddate], [phdmajor]Control[/phdmajor] Group, Department of
[phdmajor]Engineering[/phdmajor]
n (PhD/MS/BS) Advisor: we can add
a tag in the data_tag tool(not available).
l Publications
n
Not
available
n
So
far, we do not consider annotating the authored papers and technique reports
from the researchers’ web pages.
l Relationships
n
So
far, we do not consider annotating the relationship from one’s web pages
either.
One can conduct experiments by
viewing each line in the web page as a unit. The line breaker is naturally the
line break between two lines.
A html tag is a unique word.
A return is a word
For the others, the breaker is
defined as:
[\n\s\,\?\!\(\)\[\]\{\}]{1})|(?:\.\s+)|(?:\s*$)
Examples |
<IMAGE src="lucian.jpg" alt="Recent photo
of Lucian"/> |
<IMAGE
src="index_files/image002.jpg" alt=""/> |
<IMAGE
src="myself.JPG" alt=""/> |
<IMAGE
src="/presspass/images/gallery/execs/thumbnails/lee-2.jpg"
alt=""/> |
<IMAGE
src="/presspass/images/exec/bio_lee.jpg" alt=""/> |
Pri1 |
Pri2 |
Pri3 |
Professor |
corporate
vice president |
Member |
Director |
Lead
Researcher and Research Manager |
Co-Chair |
Associate
Professor |
Director
of Research |
|
Assistant
Professor |
Project
Leader |
|
Pri1 |
Pri2 |
Pri3 |
jerryne@microsoft.com |
mailto:jerryne@microsoft.com |
hangli
at microsoft dot com |
aaa@verw.uni-koeln.de |
|
cs.duke.edu,
junyang |
|
|
<Image
src=“email.jpg”> |
|
|
mshong
[at] cs.cornell.edu |
Pri1 |
[position]Professor[/position]
of Applied Mathematics
and Computer Science [affiliation]Department
of Computer Science [address] |
[affiliation] Courant
Institute[/affiliation] [address] |
As [position]corporate vice president[/position] of the
[affiliation]Natural Interactive Services Division (NISD) at Microsoft
Corp.[/affiliation] |
Pri1 |
[affiliation] Courant
Institute[/affiliation] [address] |
[address]Microsoft
Research |
Richard
Segal [affiliation] [address] |
Pri1 |
Pri2 |
(86-10)58963177 |
919-660-6587 |
(+41
22) 379 58 85 |
|
Tel:
[phone](86-10)58963177[/phone] |
|
Telephone:
[phone]203.432.6432[/phone] |
|
Pri1 |
Pri2 |
Pri3 |
882-8080 |
011-425-882-8080 |
425)
882-8080 |
(425)
882-8080 |
x.4870, ext.4870 |
425
882 8080 |
1
(425) 882-8080 |
882.8080,
425.882.8080 |
|
+1
(425) 882-8080 |
|
|
Pri1 |
Pri2 |
Fax:
[fax](86-10)88097306[/fax] |
|
Fax.
[fax]+82-62-970-2004[/fax] |
|
|
|
|
|
Pri1 |
I
obtained a [bsdegree]B.S.[/bsdegree] in [bsmajor]Electrical
Engineering[/bsmajor] from [bsuniv]Kyoto University[/bsuniv] in
[bsdate]1988[/bsdate] and a [msdegree]M.S.[/msdegree] in [msmajor]Computer
Science[/msmajor] from [msuniv]Kyoto University[/msuniv] in
[msdate]1990[/msdate]. I earned my [phddegree]Ph.D.[/phddegree] in
[phdmajor]Computer Science[/phdmajor] from [phduniv]the University of |
I
received my [msdegree]M.S.[/msdegree], [phddegree]Ph.D.[/phddegree] from
[phduniv] |
<b>Educational Background:</b> [phddegree]Ph.D.[/phddegree]
[phdmajor]Physics[/phdmajor], [phduniv]USC[/phduniv], Los Angeles, U.S.A. |
|
To construct an
evaluation data set that conforms to the spec above and has the flexibility of
easily adapting to the potential spec changes, we define the following tags.
(They are beginning tags, the corresponding eng tags should be [/tagname],
e.g., [/address]. We use the bracket rather than “<>” to different them
from the HTML tags in the web pages).
1.
[address]
2.
[affiliation]
3.
[bsdate]
4.
[bsmajor]
5.
[bsuniv]
6.
[email]
7.
[fax]
8.
[msdate]
9.
[msmajor]
10.
[msuniv]
11.
[phddate]
12.
[phdmajor]
13.
[phduniv]
14.
[phone]
15.
[position]
For the
annotation, we may add more tags, for example to annotate professional history.
specs
OF BLOCK ANNOTATION
Blocks are a larger unit compared to the tags of
basic information of a researcher. One block should include some useful
information about a researcher, in other words, it is a subset of the tags that
are defined last section. For example, the “introduction” block can include the
tags such as “position”, “affliation”, “phdmajor” and so on. Our purpose is to
build a hierarchical structure of all the labels. The tags in each block can be
considered as a child of the block, vice versa, the block is parent of its
children. Note that the passage
which contains only one label should not be annotated as a block.
l Contact info
This block always contains such
kinds of labels: address, position, affliation, fax, phone, email. A
semi-structured passage including subset of these labels should be annotated as
“contact_info” block.
<b>Office:</b>
[contactinfo][address]D327
Tel: [phone]919-660-6587[/phone]
Fax: [fax]919-660-6519[/fax][/contactinfo]
l Introduction
This block may contain different
types of information, such as position, affliation, educational history, work
experience and so on. Annotate the passages which contain the most useful
information (which we defined last section) as “introduction”, passages
containing only work experimence should not be annotated as “introduction”. If
a block contains information of employment and education, we annotate this block as introduction.
[introduction][position]Assistant
Professor[/position]
[affiliation]Computer Science Department
Home Publications Students Teaching Personal (this should be annotated as others, not as any block)
[introduction]I am an [position]Assistant
Professor[/position] of [affiliation]Computer Science at
I co-direct the [affiliation]Duke Database Research
Group[/affiliation], which is part of the Duke Systems and Architecture Group. We also
participate in the larger Carolina Database Research Group.[/introduction]
l Education
This block should enclose the educational
information, such as labels: phdmajor, phduniv, phddate and so on. The block which contains only this type of information should be
annotated. However, if the block contains other information, such as
experimence of employment should be annotated as “introduction” instead of “education”.
EDUCATION
[education]Ph.D.,
[phdmajor]Computer
Science[/phdmajor], [phddate]1979[/phddate], [phduniv]
M.S., [msmajor]Computer
Science[/msmajor], [msdate]1975[/msdate], [msuniv]
B.S., [bsmajor]Mathematics[/bsmajor], [bsdate]1973[/bsdate], [bsuniv]
l Publication
This block contains papers, lectures, talks,
patents, all reading sources related to research.
SELECTED PUBLICATIONS
[publication]R. Akers,
J. Gatheral, Y. Epelbaum,
J. Han, K. Laud, O. Lubovitsky, E. Kant, and C. Randall, "Implementing
Option-Pricing Models Using Software Synthesis," Computing in Science
& Engineering, November/December 1999, pp. 54-64.
C. Randall and E. Kant, and
A. Chhabra, "Using Program Synthesis to Price Derivatives," Journal
of Computational Finance, Vol. 1, No. 2, 1998, pp 97-128.[/publication]
SELECTED KEYNOTE TALKS
[publication]"Program
Synthesis for Mathematical Modeling Applications", Seventh International
Conference on Industrial and Engineering Applications of Artificial
Intelligence and Expert Systems, Austin, Texas, 1994
"Knowledge-Based
Support for Scientific Programming," The Seventh Knowledge-Based Software
Engineering Conference,
"Understanding and
Automating Algorithm Design," IJCAI-85,
l Research interests (do
not annotate this)
This block always contain serveral sentences, introducing
the research areas of the researcher. It can be called as “research topics”, “research
areas”. If the sentences describe the area, in other words, the topic that the
researcher is working on, the sentences should be annotated as research
interests. However, when the sentences describe the research experimence, for
example, the researcher joined a company in 1998, the block should not be
annotated as research interests. The sentences which does not contain the
special words which indicate “research interests” should not be annotated. The
areas the lab worked on should not be annotated.
RESEARCH INTERESTS
[resinterests]Automation
of mathematical modeling, aids to scientific problem solving, program
synthesis, automated algorithm design, object-oriented/rule-based programming,
knowledge representation[/resinterests].
The following should not be annotated as research interests. It can be
annotated as research
activities.
Research Experiences
* Summer 06: Microsoft
Research Database Group. Research Intern.
*
CEDR Event Processing Project
* Summer 05: Microsoft
Research Database Group. Research Intern.
*
Immortal DB Project
<b>Current Research:</b>
[resinterests]Distributed access control
systems, distributed theorem proving in access control logics, security for
mobile and pervasive computing[/resinterests]
l
Research
activities/Academic activities
Activities include position in conference, such as
program chairs, research projects (passages introducing the programs that the
researchers joined). Other
experience or introduction of projects the researcher joined should not be
annotated.