Detect language for each Amazon review
In this section of the lab, we will use the UDF to detect the language of a text column. For that, we will create a new table from our base table with an additional column: language.
The create script for the new table is available in the “Saved queries” section of your workgroup. You can either run it from the saved queries or execute it manually.
Please follow the appropriate link below.
Instructions to execute saved query
Start by clicking on the Saved queries tab
Double click on the query that has Step-2
Ensure the workgroup is V2EngineWorkGroup
and click on the Run query
Instructions to execute manually
Open the Athena query editor, make sure you are in the V2EngineWorkGroup
workgroup and execute the below query
CREATE TABLE default.amazon_reviews_with_language
WITH (format='parquet') AS
USING EXTERNAL FUNCTION detect_dominant_language(col1 VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf'
SELECT *, detect_dominant_language(review_body) AS language
The above step creates the new table, amazon_reviews_with_language
. Let us run the below query to list all languages, sorted by their occurrence count.
SELECT language, count(*) AS count FROM default.amazon_reviews_with_language GROUP BY language ORDER BY count DESC;
What just happened
We invoked the textanalytics-udf
Lambda function and used detect_dominant_language method
inside the function to detect the language in 2000 fields inside the review_body
column of the main table. The results where then written into a new column named language
The Lambda function invoked Comprehend APIs to detect the dominant language.