Hi There!

I'm Dan Schlegel, an Associate Professor in the Computer Science Department at SUNY Oswego

Assignment 5 – Structured Summaries

Microproject

Write a Python program which takes a single argument — a URL. Your program will use the Unix command curl to download the file at that URL, remove all of the embedded javascript and css from the file, and write the resulting file to the screen. Be sure to test it on a variety of URLs.

Main Project

Write a Python program that does the following:

  • Asks the user to enter a URL;
  • Downloads the contents of that URL using curl (or some other unix application);
  • Extracts features from the page, including at least:
    • Headings
    • Link text along with URLs
    • Image URL and alt text
    • Email addresses
    • Phone numbers
  • Uses unix tools to determine:
    • word count
    • top N words, sorted alphabetically
  • Builds a structured summary including the above items and show it to the user;
  • Sends the summary to an LLM using a prompt of your choice to analyze the structured summary. Some ideas:
    • Ask the LLM to try to determine what the website is about.
    • Modify the structured summary so that it is segmented by heading section and ask the LLM to verify that the contents of each section of the page matches the heading.
    • Ask the LLM to try to predict any bias on the page.

For the LLM component, I recommend using the Ollama Python library. You can run ollama on your own machine with some small-ish model if your hardware supports it, or get a free Ollama Cloud account and use an API key to access some models. The free tier should be far more than enough to do this project, even with lots of testing.

Demo

Here’s a demo of my version of the project’s output.

Enter a URL: http://pizzatoday.com

--- Summary ---
Words: 8161

Headings:
Pizza Today
Featured
Latest Posts
Latest Recipes
Latest Podcasts
General
Topics
About Us
Contact Us

Links:
Emerald Media Network | https://emeraldx.com
Advertise | https://pizzatoday.com/pizza-today-media-kit-download/
International Pizza Expo | https://pizzaexpo.pizzatoday.com/
Pizza Expo Columbus | https://pizzaexpocolumbus.pizzatoday.com/
Latest | https://pizzatoday.com/latest/
View All Posts » | https://pizzatoday.com/latest/
News | https://pizzatoday.com/news/
Press Releases | https://pizzatoday.com/press-releases/
Podcasts | https://pizzatoday.com/podcasts/
Recipes | https://pizzatoday.com/recipes/
Resources | https://pizzatoday.com/resources/
[... truncated for space ...]

Images:
| https://pizzatoday.com/wp-content/uploads/2021/12/Pizza_Today_Logo.svg
| https://pizzatoday.com/wp-content/uploads/2021/12/Pizza_Today_Logo.svg
pizzeria women of influence | https://pizzatoday.com/wp-content/uploads/2026/03/April_WebImgs_-1-1.png
Image of Mirko D'Agata, Pizza Maker of the Year 2026. | https://pizzatoday.com/wp-content/uploads/2026/03/Mirko-winner-150x150.jpg
image of World Pizza Games area at Pizza Expo 2026 | https://pizzatoday.com/wp-content/uploads/2026/03/World-Pizza-Games-flag-150x150.jpg
2026 Pizza Industry Trends Report | https://pizzatoday.com/wp-content/uploads/2025/12/Dec_WebImgs_-13-150x150.jpeg
Pizza Today Pizza Styles Guide Featured Image | https://pizzatoday.com/wp-content/uploads/2025/06/Style_Promo_900x600-150x150.png
pizzas at a pizza festival | https://pizzatoday.com/wp-content/uploads/2026/04/AdobeStock_646485542.jpeg
Image of a vegan ricotta and squash pizza. | https://pizzatoday.com/wp-content/uploads/2026/04/AdobeStock_249637819-resize.jpg
[... truncated for space ...]

Emails:

Phone Numbers:

Top words:
67 list
50 pizza
24 news
14 with
13 screen
12 only
10 dough
10 april
9 about
8 view

--- LLM Summary ---
This website appears to be an industry-focused publication about pizza, featuring news, recipes, podcasts, and resources for pizza makers and pizzeria businesses. It also promotes industry events, trends reports, and professional resources related to the pizza and restaurant industry.