Skip to content

Side, Side Projects – Web Scraping For Deals

Screenshot of App
In the midst of pursuing knowledge and learning WordPress development, I caught an article on Web scraping. I had initially learned about it from Adnan Kukic’s article on Scraping the Web With Node.js scotch.io. It got me interested in the idea and the possibilities – I for one would love to use it as a deals watcher on multiple sites (I’m always looking out for the opportunity to grab a new case for my iPhone and review it on Excessorize Me.).

History

So Adnan started my interests off, I tried the article out myself, but I think it’s a bit more advanced at this moment for me to begin learning about Web scraping. I couldn’t grasp 100% of everything he was describing and because of that, I left the idea in the back burner.

It wasn’t until I read an article today from Artiom Dashinsky on import.io, who did an analysis on Designer News to better understand the community (what a UX thing to do). He uses import.io to build crawlers to comb through the Designer News site, which basically “turn any website into a table of data or a structured API.” That actually sounds very neat and surprisingly useful.

import.io has their own YouTube channel that offers tutorials on how to use their platform, so lets give it a shot.

GATHERING DATA

I’m not going to go into detail about how import.io works, just because they’ve got some great videos and the application guides you thoroughly through the process. Using import.io, I gathered the following elements from The Source’s product pages that I thought would be useful:

The Source Product Page

  • A – Web ID
  • B – Product name
  • C – Original price
  • D – Savings
  • E – Final price
  • F – Icon image for if it’s available online
  • G – Icon image for it it’s available in-store
  • H – Image link

They have different layouts for different types of sales, I found the following 5, the differences being in how the prices are listed:

Original

The Source Product Page - Original

Sale Item

The Source Product Page - Sale

Refurbished

The Source Product Page - Refurbished

Special Sale Items

The Source Product Page - Special Sale The Source Product Page - Special Sale

So I trained import.io on the different layouts (except the last one, the Canon EOS Revel T5 example, because there was no way to differentiate between a refurbished product and a sale product listed in that format).

It takes a bit to gather the data, but it gathered about 2700 at the time, nothing near the full list of products. I exported it as a JSON file, but I wasn’t sure how the JSON was formatted. When I tried opening the file, due to it being such a large file, it wouldn’t even open for me, but I needed to know how it was organized to call the right data from it. I initially used the Chrome DevTools and console logging the outputs to try and understand it, but due to my lack of knowledge I wasn’t understanding the output it was giving me at the time.

So after a few frustrating hours failing to call the right values from each listing, I decided to recrawl the site with a much smaller set of data. I ended up with about 20 listings, which allowed me to open up the file and view the structure, which then allowed me to finally output the correct values. I used getJSON to collect and then generated my HTML output:

$.getJSON("the_source_deals_20140601.json", function(data){
  $.each(data.data, function(key, val){
    //TALLY TOTAL NUMBER OF LISTINGS, PROBABLY A BETTER WAY TO DO THIS...
    total += 1;

    //CALCULATE THE DISCOUNT
    var percentage_off = (val.price_savings/val.price_orig)*100;
    var percentage_class;

    //ASSIGN A PERCENTAGE CLASS
    if(percentage_off < 26){ percentage_class = "percentage_25"; }
    else if(percentage_off < 51){ percentage_class = "percentage_50"; }
    else if(percentage_off < 76){ percentage_class = "percentage_75"; }
    else { percentage_class = "percentage_100"; }

    //SALE PRODUCT
    if(!isNaN(percentage_off) && val.price_orig > 0){
      the_source_listings.push("<tr class='the_source_sale_item " + percentage_class + "'><td><img src='" + val.product_img + "' alt='" + val.title + "' height='50px' width='50px'></td><td><a href='" + val._pageUrl + "' target='_blank'>" + val.title + "</a></td><td>$" + val.price_orig + "</td><td>$" + val.price_savings + "</td><td>" + Number(percentage_off).toFixed(2) + "%</td><td>$" + val.price_final + "</td></tr>");
    }
    //REFURBISHED PRODUCT
    else if(val.savings == undefined && val.price_orig == 0) {
      the_source_listings.push("<tr class='the_source_refurbished_item '><td><img src='" + val.product_img + "' alt='" + val.title + "' height='50px' width='50px'></td><td><a href='" + val._pageUrl + "' target='_blank'>" + val.title + "</a></td><td>Refurbished</td><td>Refurbished</td><td>Refurbished</td><td>$" + val.price_final + "</td></tr>");
    }
  });

  $("#the_source_table").html(the_source_listings);
  $(".total").html(total);
});

BUILDING A UI

I took the opportunity to learn more about Bootstrap, so I implemented that as my template. Using the fluid grid system, I assigned 2 columns for the stacked filtering buttons and 10 for the table of data. Simple and to the point, for now:

<div class="container-fluid">
  <div class="row">
    <nav id="filter" class="col-md-2">
      <ul class="nav nav-pills nav-stacked">
        <li id="nav_all"><a href="#">ALL <span class="total"></span> DEALZ</a></li>
        <li id="nav_25" class="active"><a href="#">0% - 25%</a></li>
        <li id="nav_50" class="active"><a href="#">26% - 50%</a></li>
        <li id="nav_75" class="active"><a href="#">51% - 75%</a></li>
        <li id="nav_100" class="active"><a href="#">75% - 100%</a></li>
        <li id="nav_refurbished" class="active"><a href="#">REFURBZ</a></li>
      </ul>
    </nav>
    <div class="col-md-10">
      <table class="table">
        <thead>
          <td id="col_image">Image</td>
          <td id="col_product">Product</td>
          <td id="col_price_orig">Original Price</td>
          <td id="col_price_savings">Savings</td>
          <td id="col_percentage_off">Percentage</td>
          <td id="col_price_final">Final Price</td>
        </thead>
        <tbody id="the_source_table">

        </tbody>
      </table>
    </div>
  </div>
</div>

I didn’t try and stylize the page much, most of the style is coming from the Bootstrap theme, otherwise, it wasn’t a priority at such an early stage in the development.

CATEGORIZE

I wanted to be able to filter by discount and I colour coded them to give precedence to the higher discounts. Using jQuery, I did some simple toggling for the buttons to show when they were active. I also used jQuery to show and hide rows that are being toggled on and off by the filter buttons, but what I started to see with a larger dataset was a significant lag and performance decrease. I’ll have to figure out another way to do the filtering. An example of how each button works:

$("#nav_50").click(function(event){
  event.preventDefault();
  $(".percentage_50").toggle("slow");
  if($("#nav_50").hasClass("active")){
    $("#nav_50").removeClass("active");
  } else{
    $("#nav_50").addClass("active");
  }
});

FUTURE DEVELOPMENT

In future iterations, I’m hoping for a few things:

  • A nav bar on top of the table for different retailers, like a spreadsheet (could also be implemented in the side bar
  • A search function for particular products
  • UI improvements
  • Automate the data generation (already in the works as I’ve discussed with a friend on using Node.js as a back-end solution)
  • Update filtering process (also discussed to possibly use Angular.js)

My current file directory for the project is very simple:
g

You can check out the current version here: http://vtse.ca/portfolio/web-design/scrape-me//

Published inBlog

Comments are closed.