Contributed by Tamara Osifchin. She took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on her second class project(due at 4th week of the program).
I approached my second project — the Shiny project — as more of a fun coding challenge than a data analysis challenge. My objectives were to:
- Gain an understanding of how Shiny applications are structured.
- Use a publicly available API to retrieve the data displayed in the Shiny application.
- Use live data.
- Incorporate a word cloud, because everyone loves a word cloud…
With these objectives in mind, I came up with the idea of an application that builds a word cloud populated with terms from the top 100 trending listings on Etsy. Although the application itself is more fun than useful, the project was very effective as a teaching tool, as it incorporated several Shiny and/or coding concepts in a relatively small amount of code. The final product will be available shortly at NYC Data Science Academy’s website. As I had only a few days to complete the project, I started with an application from the Shiny Gallery (Shiny Gallery Wordcloud). From there, my development process was as follows:
- Investigate the Etsy API and find a way to generate three corpora of Etsy terms.
- Replace the three corpora of Shakespeare terms in the Shiny Gallery app with the three corpora of Etsy terms.
- Enhance the Shiny application to allow the user to update the data from Etsy realtime.
Investigating the Etsy API The Etsy API is well documented and can be found at https://www.etsy.com/developers/documentation. The first page of the API Reference lists all of the available classes, as below:
I zeroed in on the Listing class. After reading through the various methods available to retrieve Listing data and the attributes available for each Listing, I decided to use the getTrendingListings method and to retrieve the title, occasion, and ‘taxonomy_path’ for each Listing. ‘Taxonomy’ refers to search taxonomy, as displayed in the screenshot below.
‘Jewelry/Necklaces/Charm Necklaces’ is an example of a taxonomy path. Taxonomy is similar to the idea of a category, but the API documentation stated that taxonomy is replacing a similar “category” attribute, so I went with the taxonomy_path attribute. Retrieving three different attributes would give me three different sets of terms from which to build word clouds. The resulting url request is coded as:
"url_str = ("https://openapi.etsy.com/v2/listings/trending?method=GET&api_key=<my_private_API_key>&fields=title,occasion,taxonomy_path &limit=%d&offset=%d") % (batch_size, n)"
I specified the number of listings to return per request (batch_size) and the number of requests to send (n) at the command line during coding, to facilitate debugging. The final version of the code has batch_size set to 25 and n set to 4 so that terms from the top 100 listings are displayed. The data returned was generally clean and consistent, but the code does massage the raw data in a few ways. Occasionally a ‘private’ listing is returned, which the code ignores, because private listings have no data in them. The code also converts accented e characters (as in ‘decor’, which appears often on Etsy) to a basic ‘e’ to simplify downstream string processing. Finally, white space within the levels of the taxonomy paths is removed, so that levels such as ‘home & decor’ appear as ‘home&decor’ in the final word cloud. Replace the Three Corpora of Shakespeare Terms with the Etsy Trending Terms The code reads the three attribute values (title, occasion, taxonomy path) for each listing into a data frame, one attribute per column, and then concatenates the row values of each column into a text vector. Each of the three text vectors are fed into functions from the tm (Text Mining) R library that the original word cloud gallery application used to create the corpora for the word clouds. I slightly modified the content transformers used by the original Shiny Gallery application; for example, the Gallery application removed all punctuation from each corpus, but the Etsy Trending application retains some punctuation (specifically, the ‘&’) in the taxonomy corpus. Modify the Shiny Application to Allow the User to Update the Data in Real Time Modifying the application to retrieve and process live data was the most challenging part of the project. The first challenge was to design a set of nested reactive functions to minimize the number of calls to Etsy. Minimizing the frequency of requests to Etsy not only improves user response time but also reduces the chances of being refused service by Etsy for sending too many calls per day. My design goals were as follows:
- Content of corpora should be refreshed from Etsy only when button is clicked
- Corpus value (i.e. title corpus, occasion corpus or taxonomy corpus) should be set when button is clicked or listing attribute is selected
- Word cloud should be rendered when button is clicked, listing attribute is selected or slider value is changed
The following pseudo code describes the final design:
output$plot <- renderPlot({#refreshData reacts to action button only: refreshData()
#selectTerms reacts to action button and attribute selection:
v <- selectTerms()
#wordcloud_rep reacts to sliders, as well as a change in v:
wordcloud_rep(names(v), v, scale=c(4,0.5), min.freq = input$freq,
max.words=input$max, colors=brewer.pal(8, "Dark2"))
})
The second challenge in working with live data was error handling. With live data the code has to be prepared for almost anything. The Etsy Trending application has some error handling, in that it displays a simple message to the user when the Etsy request returns anything other than http request status 200, and it skips over the ‘private’ listings. However, I would have liked to do more testing, to see whether there are any other anomalies in the data that the code might need to support. The first ‘private’ listing to appear in my data actually appeared after I had submitted the project, when I was working on the project presentation. More thorough testing of the code and more robust error handling in the original code would have prevented the last minute scramble to fix the code before the presentation. One Additional Note I added the ‘Include jewelry taxonomy’ button at the last minute, because the jewelry term was so much more frequent than every other taxonomy term that it overshadowed the other terms in the cloud. Unchecking the ‘Include jewelry taxonomy’ button allows the more interesting terms underneath to display more prominently. In Conclusion The final product was a hit with my audience (i.e. classmates) and actually inspired a few of them to build word clouds and tackle APIs on their next projects. The data displayed in the Etsy Trending Word Cloud may not surprise anyone, but that was not the goal. My goals were to develop an understanding of how a Shiny application is structured, to practice my coding skills, and to bring another lovely word cloud into existence. Mission accomplished.