Amazon, as the largest e-commerce corporation in the United States, offers the widest range of products in the world. Their product data can be useful in a variety of ways, and you can easily extract this data with web scraping. This guide will help you develop your approach for extracting product and pricing information from Amazon, and you’ll better understand how to use web scraping tools and tricks to efficiently gather the data you need.
The Benefits of Scraping Amazon
Web scraping Amazon data helps you concentrate on competitor price research, real-time cost monitoring and seasonal shifts in order to provide consumers with better product offers. Web scraping allows you to extract relevant data from the Amazon website and save it in a spreadsheet or JSON format. You can even automate the process to update the data on a regular weekly or monthly basis.
There is currently no way to simply export product data from Amazon to a spreadsheet. Whether it’s for competitor testing, comparison shopping, creating an API for your app project or any other business need we’ve got you covered. This problem is easily solved with web scraping.
Here are some other specific benefits of using a web scraper for Amazon:
- Utilize details from product search results to improve your Amazon SEO status or Amazon marketing campaigns
- Compare and contrast your offering with that of your competitors
- Use review data for review management and product optimization for retailers or manufacturers
- Discover the products that are trending and look up the top-selling product lists for a group
Scraping Amazon is an intriguing business today, with a large number of companies offering goods, price, analysis, and other types of monitoring solutions specifically for Amazon. Attempting to scrape Amazon data on a wide scale, however, is a difficult process that often gets blocked by their anti-scraping technology. It’s no easy task to scrape such a giant site when you’re a beginner, so this step-by-step guide should help you scrape Amazon data, especially when you’re using Python Scrapy and Scraper API.
First, Decide On Your Web Scraping Approach
One method for scraping data from Amazon is to crawl each keyword’s category or shelf list, then request the product page for each one before moving on to the next. This is best for smaller scale, less-repetitive scraping. Another option is to create a database of products you want to track by having a list of products or ASINs (unique product identifiers), then have your Amazon web scraper scrape each of these individual pages every day/week/etc. This is the most common method among scrapers who track products for themselves or as a service.
Scrape Data From Amazon Using Scraper API with Python Scrapy
Scraper API allows you to scrape the most challenging websites like Amazon at scale for a fraction of the cost of using residential proxies. We designed anti-bot bypasses right into the API, and you can access additional features like IP geotargeting (&country code=us)
for over 50 countries, JavaScript rendering (&render=true)
, JSON parsing (&autoparse=true)
and more by simply adding extra parameters to your API requests. Send your requests to our single API endpoint or proxy port, and we’ll provide a successful HTML response.
Start Scraping with Scrapy
Scrapy is a web crawling and data extraction platform that can be used for a variety of applications such as data mining, information retrieval and historical archiving. Since Scrapy is written in the Python programming language, you’ll need to install Python before you can use pip (a python manager tool).
To install Scrapy using pip, run:
pip install scrapy |
Then go to the folder where your project is saved (Scrapy automatically creates a web scraping project folder for you) and run the “startproject” command along with the project name, “amazon_scraper”. Scrapy will construct a web scraping project folder for you, with everything already set up:
scrapy startproject amazon_scraper |
The result should look like this:
├── scrapy.cfg # deploy configuration file └── tutorial # project's Python module, you'll import your code from here ├── __init__.py ├── items.py # project items definition file ├── middlewares.py # project middlewares file ├── pipelines.py # project pipeline file ├── settings.py # project settings file └── spiders # a directory where spiders are located ├── __init__.py └── amazon.py # spider we just created |
Scrapy creates all of the files you’ll need, and each file serves a particular purpose:
- Items.py – Can be used to build your base dictionary, which you can then import into the spider.
- Settings.py – All of your request settings, pipeline, and middleware activation happens in settings.py. You can adjust the delays, concurrency, and several other parameters here.
- Pipelines.py – The item yielded by the spider is transferred to Pipelines.py, which is mainly used to clean the text and bind to databases (Excel, SQL, etc).
- Middlewares.py – When you want to change how the request is made and scrapy manages the answer, Middlewares.py comes in handy.
Create an Amazon Spider
You’ve established the project’s overall structure, so now you’re ready to start working on the spiders that will do the scraping. Scrapy has a variety of spider species, but we’ll focus on the most popular one, the Generic Spider, in this tutorial.
Simply run the “genspider” command to make a new spider:
# syntax is --> scrapy genspider name_of_spider website.com scrapy genspider amazon amazon.com |
Scrapy now creates a new file with a spider template, and you’ll gain a new file called “amazon.py” in the spiders folder. Your code should look like the following:
import scrapy class AmazonSpider(scrapy.Spider): name = 'amazon' allowed_domains = ['amazon.com'] start_urls = ['http://www.amazon.com/'] def parse(self, response): pass |
Delete the default code (allowed domains, start urls, and the parse function) and replace it with your own, which should include these four functions:
- start_requests — sends an Amazon search query with a specific keyword.
- parse_keyword_response — extracts the ASIN value for each product returned in an Amazon keyword query, then sends a new request to Amazon for the product listing. It will also go to the next page and do the same thing.
- parse_product_page — extracts all of the desired data from the product page.
- get_url — sends the request to the Scraper API, which will return an HTML response.
Send a Search Query to Amazon
You can now scrape Amazon for a particular keyword using the following steps, with an Amazon spider and Scraper API as the proxy solution. This will allow you to scrape all of the key details from the product page and extract each product’s ASIN. All pages returned by the keyword query will be parsed by the spider. Try using these fields for the spider to scrape from the Amazon product page:
- ASIN
- Product name
- Price
- Product description
- Image URL
- Available sizes and colors
- Customer ratings
- Number of reviews
- Seller ranking
The first step is to create start_requests, a function that sends Amazon search requests containing our keywords. Outside of AmazonSpider, you can easily identify a list variable using our search keywords. Input the keywords you want to search for in Amazon into your script:
queries = [‘tshirt for men’, ‘tshirt for women’] |
Inside the AmazonSpider, you cas build your start_requests feature, which will submit the requests to Amazon. Submit a search query “k=SEARCH KEYWORD” to access Amazon’s search features via a URL:
It looks like this when we use it in the start_requests function:
## amazon.py queries = ['tshirt for men', ‘tshirt for women’] class AmazonSpider(scrapy.Spider): def start_requests(self): for query in queries: url = 'https://www.amazon.com/s?' + urlencode({'k': query}) yield scrapy.Request(url=url, callback=self.parse_keyword_response) |
You will urlencode
each query in your queries list so that it is secure to use as a query string in a URL, and then use scrapy.Request
to request that URL.
Use yield
instead of return
since Scrapy is asynchronous, so the functions can either return a request or a completed dictionary. If a new request is received, the callback method is invoked. If an object is yielded, it will be sent to the data cleaning pipeline. The parse_keyword_response
callback function will then extract the ASIN for each product when scrapy.Request
activates it.
How to Scrape Amazon Products
One of the most popular methods to scrape Amazon includes extracting data from a product listing page. Using an Amazon product page ASIN ID is the simplest and most common way to retrieve this data. Every product on Amazon has an ASIN, which is a unique identifier. We may use this ID in our URLs to get the product page for any Amazon product, such as the following:
Using Scrapy’s built-in XPath selector extractor methods, we can extract the ASIN value from the product listing tab. You can build an XPath selector in Scrapy Shell that captures the ASIN value for each product on the product listing page and generates a url for each product:
products = response.xpath('//*[@data-asin]') for product in products: asin = product.xpath('@data-asin').extract_first() product_url = f"https://www.amazon.com/dp/{asin}" |
The function will then be configured to send a request to this URL and then call the parse_product_page
callback function when it receives a response. This request will also include the meta parameter, which is used to move items between functions or edit certain settings.
def parse_keyword_response(self, response): products = response.xpath('//*[@data-asin]') for product in products: asin = product.xpath('@data-asin').extract_first() product_url = f"https://www.amazon.com/dp/{asin}" yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin}) |
Extract Product Data From the Amazon Product Page
After the parse_keyword_response
function requests the product pages URL, it transfers the response it receives from Amazon along with the ASIN ID in the meta parameter to the parse product page callback function. We now want to derive the information we need from a product page, such as a product page for a t-shirt.
You need to create XPath selectors to extract each field from the HTML response we get from Amazon:
def parse_product_page(self, response): asin = response.meta['asin'] title = response.xpath('//*[@id="productTitle"]/text()').extract_first() image = re.search('"large":"(.*?)"',response.text).groups()[0] rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first() number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first() bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract() seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract() |
Try using a regex selector over an XPath selector for scraping the image url if the XPath is extracting the image in base64.
When working with large websites like Amazon that have a variety of product pages, you’ll find that writing a single XPath selector isn’t always enough since it will work on certain pages but not others. To deal with the different page layouts, you’ll need to write several XPath selectors in situations like these.
When you run into this issue, give the spider three different XPath options:
def parse_product_page(self, response): asin = response.meta['asin'] title = response.xpath('//*[@id="productTitle"]/text()').extract_first() image = re.search('"large":"(.*?)"',response.text).groups()[0] rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first() number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first() bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract() seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract() price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first() if not price: price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \ response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first() |
If the spider is unable to locate a price using the first XPath selector, it goes on to the next. If we look at the product page again, we can see that there are different sizes and colors of the product.
To get this info, we’ll write a fast test to see if this section is on the page, and if it is, we’ll use regex selectors to extract it.
temp = response.xpath('//*[@id="twister"]') sizes = [] colors = [] if temp: s = re.search('"variationValues" : ({.*})', response.text).groups()[0] json_acceptable = s.replace("'", "\"") di = json.loads(json_acceptable) sizes = di.get('size_name', []) colors = di.get('color_name', []) |
When all of the pieces are in place, the parse_product_page
function will return a JSON object, which will be sent to the pipelines.py file for data cleaning:
def parse_product_page(self, response): asin = response.meta['asin'] title = response.xpath('//*[@id="productTitle"]/text()').extract_first() image = re.search('"large":"(.*?)"',response.text).groups()[0] rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first() number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first() price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first() if not price: price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \ response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first() temp = response.xpath('//*[@id="twister"]') sizes = [] colors = [] if temp: s = re.search('"variationValues" : ({.*})', response.text).groups()[0] json_acceptable = s.replace("'", "\"") di = json.loads(json_acceptable) sizes = di.get('size_name', []) colors = di.get('color_name', []) bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract() seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract() yield {'asin': asin, 'Title': title, 'MainImage': image, 'Rating': rating, 'NumberOfReviews': number_of_reviews, 'Price': price, 'AvailableSizes': sizes, 'AvailableColors': colors, 'BulletPoints': bullet_points, 'SellerRank': seller_rank} |
How To Scrape Every Amazon Product on Amazon Product Pages
Our spider can now search Amazon using the keyword we provide and scrape the product information it returns on the website. What if, on the other hand, we want our spider to go through each page and scrape the items on each one?
To accomplish this, we simply need to add a few lines of code to our parse_keyword_response
function:
def parse_keyword_response(self, response): products = response.xpath('//*[@data-asin]') for product in products: asin = product.xpath('@data-asin').extract_first() product_url = f"https://www.amazon.com/dp/{asin}" yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin}) next_page = response.xpath('//li[@class="a-last"]/a/@href').extract_first() if next_page: url = urljoin("https://www.amazon.com",next_page) yield scrapy.Request(url=product_url, callback=self.parse_keyword_response) |
After scraping all of the product pages on the first page, the spider would look to see if there is a next page button. If there is, the url extension will be retrieved and a new URL for the next page will be generated. For Example:
It will then use the callback to restart the parse keyword response function and extract the ASIN IDs for each product as well as all of the product data as before.
Test Your Spider
Once you’ve developed your spider, you can now test it with the built-in Scrapy CSV exporter:
scrapy crawl amazon -o test.csv |
You may notice that there are two issues:
- The text is sloppy and some values appear to be in lists.
- You’re retrieving 429 responses from Amazon, and therefore Amazon detects that your requests are coming from a bot so Amazon is blocking the spider.
If Amazon detects a bot, it’s likely that Amazon will ban your IP address and you won’t have the ability to scrape Amazon. In order to solve this issue, you need a large proxy pool and you also need to rotate the proxies and headers for every request. Luckily, Scraper API can help eliminate this hassle.
Connect Your Proxies with Scraper API to Scrape Amazon
Scraper API is a proxy API designed to make web scraping proxies easier to use. Instead of discovering and creating your own proxy infrastructure to rotate proxies and headers for each request, or detecting bans and bypassing anti-bots, you can simply send the URL you want to scrape to the Scraper API. Scraper API will take care of all of your proxy needs and ensure that your spider works in order to successfully scrape Amazon.
Scraper API must be integrated with your spider, and there are three ways to do so:
- Via a single API endpoint
- Scraper API Python SDK
- Scraper API proxy port
If you integrate the API by configuring your spider to send all of your requests to their API endpoint, you just need to build a simple function that sends a GET request to Scraper API with the URL we want to scrape.
First sign up for Scraper API to receive a free API key that allows you to scrape 1,000 pages per month. Fill in the API_KEY variable with your API key:
API = ‘<YOUR_API_KEY>’ def get_url(url): payload = {'api_key': API_KEY, 'url': url} proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload) return proxy_url |
Then, by setting the url parameter in scrapy, we can change our spider functions to use the Scraper API proxy. get_url(url)
:
def start_requests(self): ... … yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response) def parse_keyword_response(self, response): ... … yield scrapy.Request(url=get_url(product_url), callback=self.parse_product_page, meta={'asin': asin}) ... … yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response) |
Simply add an extra parameter to the payload to allow geotagging, JS rendering, residential proxies, and other features. We’ll use the Scraper API’s geotargeting function to make Amazon think our requests are coming from the US, because Amazon adjusts the price data and supplier data displayed depending on the country you’re making the request from. To accomplish this, we must add the flag "&country code=us"
to the request, which can be accomplished by adding another parameter to the payload variable.
Requests for geotargeting from the United States would look like the following:
def get_url(url): payload = {'api_key': API_KEY, 'url': url, 'country_code': 'us'} proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload) return proxy_url |
Then, based on the concurrency limit of our Scraper API plan, we need to adjust the number of concurrent requests we’re authorized to make in the settings.py file. The number of requests you may make in parallel at any given time is referred to as concurrency. The quicker you can scrape, the more concurrent requests you can produce.
The spider’s maximum concurrency is set to 5 concurrent requests by default, as this is the maximum concurrency permitted on Scraper API’s free plan. If your plan allows you to scrape with higher concurrency, then be sure to increase the maximum concurrency in settings.py
.
Set RETRY_TIMES
to 5 to tell Scrapy to retry any failed requests, and make sure DOWNLOAD_DELAY
and RANDOMIZE_DOWNLOAD_DELAY
aren’t allowed because they reduce concurrency and aren’t required with the Scraper API.
## settings.py CONCURRENT_REQUESTS = 5 RETRY_TIMES = 5 # DOWNLOAD_DELAY # RANDOMIZE_DOWNLOAD_DELAY |
Don’t Forget to Clean Up Your Data With Pipelines
As a final step, clean up the data using the pipelines.py
file when the text is a mess and some of the values appear as lists.
class TutorialPipeline: def process_item(self, item, spider): for k, v in item.items(): if not v: item[k] = '' # replace empty list or None with empty string continue if k == 'Title': item[k] = v.strip() elif k == 'Rating': item[k] = v.replace(' out of 5 stars', '') elif k == 'AvailableSizes' or k == 'AvailableColors': item[k] = ", ".join(v) elif k == 'BulletPoints': item[k] = ", ".join([i.strip() for i in v if i.strip()]) elif k == 'SellerRank': item[k] = " ".join([i.strip() for i in v if i.strip()]) return item |
The item is transferred to the pipeline for cleaning after the spider has yielded a JSON object. We need to add the pipeline to the settings.py file to make it work:
## settings.py ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300} |
Now you’re good to go and you can use the following command to run the spider and save the result to a csv file:
scrapy crawl amazon -o test.csv |
How to Scrape Other Popular Amazon Pages
You can modify the language, response encoding and other aspects of the data returned by Amazon by adding extra parameters to these urls. Remember to always ensure that these urls are safely encoded. We already went over the ways to scrape an Amazon product page, but you can also try scraping the search and sellers pages by adding the following modifications to your script.
Search Page
- To get the search results, simply enter a keyword into the url and safely encode it.
- Format:
https://www.amazon.com/s?<SEARCH KEYWORD>
- You may add extra parameters to the search to filter the results by price, brand and other factors.
Sellers Page
- Instead of a dedicated page showing what other sellers offer a product, Amazon recently updated these pages so that now a component slides in. You must now submit a request to the AJAX endpoint that populates this slide-in in order to scrape this data.
- You can refine these findings by using additional parameters such as the item’s state, etc.
- Example:
https://www.amazon.com/gp/aod/ajax/ref=tmm_pap_new_aod_0?filters={"all":true,"new":true}&condition=new&asin=1844076342&pc=dp
- Example:
Forget Headless Browsers and Use the Right Amazon Proxy
99.9% of the time you don’t need to use a headless browser. You can scrape Amazon more quickly, cheaply and reliably if you use standard HTTP requests rather than a headless browser in most cases. If you opt for this, don’t enable JS rendering when using the API.
Residential Proxies Aren’t Essential
Scraping Amazon at scale can be done without having to resort to residential proxies, so long as you use high quality datacenter IPs and carefully manage the proxy and user agent rotation.
Don’t Forget About Geotargeting
Geotargeting is a must when you’re scraping a site like Amazon. When scraping Amazon, make sure your requests are geotargeted correctly, or Amazon can return incorrect information.
Previously, you could rely on cookies to geotarget your requests; however, Amazon has improved its detection and blocking of these types of requests. As a result, proxies located in that country must be used to geotarget a particular country. To do this with the scraper API, for example, set country_code=us
.
If you want to see results that Amazon would show to a person in the U.S., you’ll need a US proxy, and if you want to see results that Amazon would show to a person in Germany, you’ll need a German proxy. You must use proxies located in that region if you want to accurately geotarget a specific state, city or postcode.
Scraping Amazon doesn’t have to be difficult with this guide, no matter your coding abilities, scraping needs and budget. You will be able to obtain complete data and make good use of it thanks to the numerous scraping tools and tips available.