Imperial College London > Talks@ee.imperial > Featured talks > HiPEDS Seminar - Web Data Extraction: A Crash Course

HiPEDS Seminar - Web Data Extraction: A Crash Course

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Wiesia R Hsissen.

Abstract: Data acquisition plays an important role in modern organisations and is a strategic business process for data-driven companies such as insurers, retailers, and search engines. Data acquisition processes range from manual data collection and purchase, to cheaper but often technically challenging methods such as automated collection and crowdsourcing. The abundance of web data has made web scraping (also known as web data extraction or web wrapping) an essential tool in data acquisition processes. A wrapper is a program that turns web content into structured data using techniques ranging from visual analysis of the rendered page to DOM tree mining. Web scraping is often the only viable data collection method for websites, in particular when no API is available. Although web scraping typically relies on inducing a wrapper for every source, a number of semi- or fully automated techniques for web scraping have emerged. These recent advances have finally allowed for accurate and fully automated wrapper induction at the scale of hundreds of thousands of sources. They have also contributed to revitalised the area, as evident from a growing number of web scraping startups, e.g., Import.io, DiffBot, ScrapingHub, and Wrapidity.

Bio: Giorgio Orsi is a Senior Research Scientist at Meltwater and an Honorary Researcher at the School of Computer Science of the University of Birmingham. His research deals with the algorithmic aspects of large-scale data processing and with the logical foundations of information integration and knowledge representation. Giorgio is a co-investigator of the EPSRC Programme Grant VADA (Value Added Data Systems) and a co-founder of Wrapidity, an Oxford University startup, that was recently acquired by Meltwater to boost collection of outside data using AI.

This lecture is a crash course in Web Scraping. We will start with an overview of the available techniques and technologies, discussing when and where they are appropriate. We will then introduce the Open Source OXPath language for declarative web scraping.

This talk is part of the Featured talks series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

Changes to Talks@imperial | Privacy and Publicity