Indexes - Analyze
Shows how an analyzer breaks text into tokens.
POST {endpoint}/indexes('{indexName}')/search.analyze?api-version=2024-07-01
URI Parameters
Name | In | Required | Type | Description |
---|---|---|---|---|
endpoint
|
path | True |
string |
The endpoint URL of the search service. |
index
|
path | True |
string |
The name of the index for which to test an analyzer. |
api-version
|
query | True |
string |
Client Api Version. |
Request Header
Name | Required | Type | Description |
---|---|---|---|
x-ms-client-request-id |
string uuid |
The tracking ID sent with the request to help with debugging. |
Request Body
Name | Required | Type | Description |
---|---|---|---|
text | True |
string |
The text to break into tokens. |
analyzer |
The name of the analyzer to use to break the given text. If this parameter is not specified, you must specify a tokenizer instead. The tokenizer and analyzer parameters are mutually exclusive. |
||
charFilters |
An optional list of character filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter. |
||
tokenFilters |
An optional list of token filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter. |
||
tokenizer |
The name of the tokenizer to use to break the given text. If this parameter is not specified, you must specify an analyzer instead. The tokenizer and analyzer parameters are mutually exclusive. |
Responses
Name | Type | Description |
---|---|---|
200 OK | ||
Other Status Codes |
Error response. |
Examples
SearchServiceIndexAnalyze
Sample request
POST https://myservice.search.windows.net/indexes('hotels')/search.analyze?api-version=2024-07-01
{
"text": "Text to analyze",
"analyzer": "standard.lucene"
}
Sample response
{
"tokens": [
{
"token": "text",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "to",
"startOffset": 5,
"endOffset": 7,
"position": 1
},
{
"token": "analyze",
"startOffset": 8,
"endOffset": 15,
"position": 2
}
]
}
Definitions
Name | Description |
---|---|
Analyzed |
Information about a token returned by an analyzer. |
Analyze |
Specifies some text and analysis components used to break that text into tokens. |
Analyze |
The result of testing an analyzer on text. |
Char |
Defines the names of all character filters supported by the search engine. |
Error |
The resource management error additional info. |
Error |
The error detail. |
Error |
Error response |
Lexical |
Defines the names of all text analyzers supported by the search engine. |
Lexical |
Defines the names of all tokenizers supported by the search engine. |
Token |
Defines the names of all token filters supported by the search engine. |
AnalyzedTokenInfo
Information about a token returned by an analyzer.
Name | Type | Description |
---|---|---|
endOffset |
integer |
The index of the last character of the token in the input text. |
position |
integer |
The position of the token in the input text relative to other tokens. The first token in the input text has position 0, the next has position 1, and so on. Depending on the analyzer used, some tokens might have the same position, for example if they are synonyms of each other. |
startOffset |
integer |
The index of the first character of the token in the input text. |
token |
string |
The token returned by the analyzer. |
AnalyzeRequest
Specifies some text and analysis components used to break that text into tokens.
Name | Type | Description |
---|---|---|
analyzer |
The name of the analyzer to use to break the given text. If this parameter is not specified, you must specify a tokenizer instead. The tokenizer and analyzer parameters are mutually exclusive. |
|
charFilters |
An optional list of character filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter. |
|
text |
string |
The text to break into tokens. |
tokenFilters |
An optional list of token filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter. |
|
tokenizer |
The name of the tokenizer to use to break the given text. If this parameter is not specified, you must specify an analyzer instead. The tokenizer and analyzer parameters are mutually exclusive. |
AnalyzeResult
The result of testing an analyzer on text.
Name | Type | Description |
---|---|---|
tokens |
The list of tokens returned by the analyzer specified in the request. |
CharFilterName
Defines the names of all character filters supported by the search engine.
Name | Type | Description |
---|---|---|
html_strip |
string |
A character filter that attempts to strip out HTML constructs. See https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.html |
ErrorAdditionalInfo
The resource management error additional info.
Name | Type | Description |
---|---|---|
info |
object |
The additional info. |
type |
string |
The additional info type. |
ErrorDetail
The error detail.
Name | Type | Description |
---|---|---|
additionalInfo |
The error additional info. |
|
code |
string |
The error code. |
details |
The error details. |
|
message |
string |
The error message. |
target |
string |
The error target. |
ErrorResponse
Error response
Name | Type | Description |
---|---|---|
error |
The error object. |
LexicalAnalyzerName
Defines the names of all text analyzers supported by the search engine.
Name | Type | Description |
---|---|---|
ar.lucene |
string |
Lucene analyzer for Arabic. |
ar.microsoft |
string |
Microsoft analyzer for Arabic. |
bg.lucene |
string |
Lucene analyzer for Bulgarian. |
bg.microsoft |
string |
Microsoft analyzer for Bulgarian. |
bn.microsoft |
string |
Microsoft analyzer for Bangla. |
ca.lucene |
string |
Lucene analyzer for Catalan. |
ca.microsoft |
string |
Microsoft analyzer for Catalan. |
cs.lucene |
string |
Lucene analyzer for Czech. |
cs.microsoft |
string |
Microsoft analyzer for Czech. |
da.lucene |
string |
Lucene analyzer for Danish. |
da.microsoft |
string |
Microsoft analyzer for Danish. |
de.lucene |
string |
Lucene analyzer for German. |
de.microsoft |
string |
Microsoft analyzer for German. |
el.lucene |
string |
Lucene analyzer for Greek. |
el.microsoft |
string |
Microsoft analyzer for Greek. |
en.lucene |
string |
Lucene analyzer for English. |
en.microsoft |
string |
Microsoft analyzer for English. |
es.lucene |
string |
Lucene analyzer for Spanish. |
es.microsoft |
string |
Microsoft analyzer for Spanish. |
et.microsoft |
string |
Microsoft analyzer for Estonian. |
eu.lucene |
string |
Lucene analyzer for Basque. |
fa.lucene |
string |
Lucene analyzer for Persian. |
fi.lucene |
string |
Lucene analyzer for Finnish. |
fi.microsoft |
string |
Microsoft analyzer for Finnish. |
fr.lucene |
string |
Lucene analyzer for French. |
fr.microsoft |
string |
Microsoft analyzer for French. |
ga.lucene |
string |
Lucene analyzer for Irish. |
gl.lucene |
string |
Lucene analyzer for Galician. |
gu.microsoft |
string |
Microsoft analyzer for Gujarati. |
he.microsoft |
string |
Microsoft analyzer for Hebrew. |
hi.lucene |
string |
Lucene analyzer for Hindi. |
hi.microsoft |
string |
Microsoft analyzer for Hindi. |
hr.microsoft |
string |
Microsoft analyzer for Croatian. |
hu.lucene |
string |
Lucene analyzer for Hungarian. |
hu.microsoft |
string |
Microsoft analyzer for Hungarian. |
hy.lucene |
string |
Lucene analyzer for Armenian. |
id.lucene |
string |
Lucene analyzer for Indonesian. |
id.microsoft |
string |
Microsoft analyzer for Indonesian (Bahasa). |
is.microsoft |
string |
Microsoft analyzer for Icelandic. |
it.lucene |
string |
Lucene analyzer for Italian. |
it.microsoft |
string |
Microsoft analyzer for Italian. |
ja.lucene |
string |
Lucene analyzer for Japanese. |
ja.microsoft |
string |
Microsoft analyzer for Japanese. |
keyword |
string |
Treats the entire content of a field as a single token. This is useful for data like zip codes, ids, and some product names. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html |
kn.microsoft |
string |
Microsoft analyzer for Kannada. |
ko.lucene |
string |
Lucene analyzer for Korean. |
ko.microsoft |
string |
Microsoft analyzer for Korean. |
lt.microsoft |
string |
Microsoft analyzer for Lithuanian. |
lv.lucene |
string |
Lucene analyzer for Latvian. |
lv.microsoft |
string |
Microsoft analyzer for Latvian. |
ml.microsoft |
string |
Microsoft analyzer for Malayalam. |
mr.microsoft |
string |
Microsoft analyzer for Marathi. |
ms.microsoft |
string |
Microsoft analyzer for Malay (Latin). |
nb.microsoft |
string |
Microsoft analyzer for Norwegian (Bokmål). |
nl.lucene |
string |
Lucene analyzer for Dutch. |
nl.microsoft |
string |
Microsoft analyzer for Dutch. |
no.lucene |
string |
Lucene analyzer for Norwegian. |
pa.microsoft |
string |
Microsoft analyzer for Punjabi. |
pattern |
string |
Flexibly separates text into terms via a regular expression pattern. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/PatternAnalyzer.html |
pl.lucene |
string |
Lucene analyzer for Polish. |
pl.microsoft |
string |
Microsoft analyzer for Polish. |
pt-BR.lucene |
string |
Lucene analyzer for Portuguese (Brazil). |
pt-BR.microsoft |
string |
Microsoft analyzer for Portuguese (Brazil). |
pt-PT.lucene |
string |
Lucene analyzer for Portuguese (Portugal). |
pt-PT.microsoft |
string |
Microsoft analyzer for Portuguese (Portugal). |
ro.lucene |
string |
Lucene analyzer for Romanian. |
ro.microsoft |
string |
Microsoft analyzer for Romanian. |
ru.lucene |
string |
Lucene analyzer for Russian. |
ru.microsoft |
string |
Microsoft analyzer for Russian. |
simple |
string |
Divides text at non-letters and converts them to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/SimpleAnalyzer.html |
sk.microsoft |
string |
Microsoft analyzer for Slovak. |
sl.microsoft |
string |
Microsoft analyzer for Slovenian. |
sr-cyrillic.microsoft |
string |
Microsoft analyzer for Serbian (Cyrillic). |
sr-latin.microsoft |
string |
Microsoft analyzer for Serbian (Latin). |
standard.lucene |
string |
Standard Lucene analyzer. |
standardasciifolding.lucene |
string |
Standard ASCII Folding Lucene analyzer. See https://learn.microsoft.com/rest/api/searchservice/Custom-analyzers-in-Azure-Search#Analyzers |
stop |
string |
Divides text at non-letters; Applies the lowercase and stopword token filters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/StopAnalyzer.html |
sv.lucene |
string |
Lucene analyzer for Swedish. |
sv.microsoft |
string |
Microsoft analyzer for Swedish. |
ta.microsoft |
string |
Microsoft analyzer for Tamil. |
te.microsoft |
string |
Microsoft analyzer for Telugu. |
th.lucene |
string |
Lucene analyzer for Thai. |
th.microsoft |
string |
Microsoft analyzer for Thai. |
tr.lucene |
string |
Lucene analyzer for Turkish. |
tr.microsoft |
string |
Microsoft analyzer for Turkish. |
uk.microsoft |
string |
Microsoft analyzer for Ukrainian. |
ur.microsoft |
string |
Microsoft analyzer for Urdu. |
vi.microsoft |
string |
Microsoft analyzer for Vietnamese. |
whitespace |
string |
An analyzer that uses the whitespace tokenizer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html |
zh-Hans.lucene |
string |
Lucene analyzer for Chinese (Simplified). |
zh-Hans.microsoft |
string |
Microsoft analyzer for Chinese (Simplified). |
zh-Hant.lucene |
string |
Lucene analyzer for Chinese (Traditional). |
zh-Hant.microsoft |
string |
Microsoft analyzer for Chinese (Traditional). |
LexicalTokenizerName
Defines the names of all tokenizers supported by the search engine.
TokenFilterName
Defines the names of all token filters supported by the search engine.