Love.Law.Robots. by Ang Hou Fu

scrapy

Feature image

I’ve wanted to pen down my thoughts on the next stage of the evolution of my projects for some time. Here I go!

What’s next after pdpc-decisions?

I had a lot of fun writing pdpc-decisions. It scraped data from the Personal Data Protection Commission’s enforcement decisions web page and produced a table, downloads and text. Now I got my copy of the database! From there, I made visualisations, analyses and fun graphs.

All for free.

The “free” includes the training I got coding in Python and trying out various stages of software development, from writing tests to distributing a package as a module and a docker container.

In the lofty “what’s next” section of the post, I wrote:

The ultimate goal of this code, however leads to my slow-going super-project, which I called zeeker. It’s a database of personal data protection resources in the cloud, and I hope to expand on the source material here to create an even richer database. So this will not be my last post on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interest in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it helpful so if there is an interest, please let me know!

What has happened since then?

For one, personal data protection commission decisions are not interesting enough for me. Since working on that project, the deluge of decisions has trickled as the PDPC appeared to have changed its focus to compliance and other cool techy projects.

Furthermore, there are much more interesting data out there: for example, the PDPC has created many valuable guidelines which are currently unsearchable. As Singapore’s rules and regulations grow in complexity, there’s much hidden beneath the surface. The zeeker project shouldn’t just focus on a narrow area of law or judgements and decisions.

In terms of system architecture, I made two other decisions.

Use more open-source libraries, and code less.

I grew more confident in my coding skills doing pdpc-decisions, but I used a few basic libraries and hacked my way through the data. When I look back at my code, it is unmaintainable. Any change can break the library, and the bog of whacked-up coding made it hard for me to understand what I was doing several months later. Tests, comments and other documentation help, but only if you’re a disciplined person. I’m not that kind of guy.

Besides writing code (which takes time and lots of motivation), I could also “piggyback” on the efforts of others to create a better stack. The stack I’ve decided so far has made coding more pleasant.

There are also other programs I would like to try — for example, I plan to deliver the data through an API, so I don’t need to use Python to code the front end. A Javascript framework like Next.JS would be more effective for developing websites.

Decoupling the project with the programming language also expands the palette of tools I can have. For example, instead of using a low-level Python library like pdfminer to “plumb” a PDF, I could use a self-hosted docker container like parsr to OCR or analyse the PDF and then convert it to text.

It’s about finding the best tool for the job, not depending only on my (mediocre) programming skills to bring results.

There’s, of course, an issue of technical debt (if parsr is not being developed anymore, my project can slow down as well). I think this is not so bad because all the projects I picked are open-source. I would also pick well-documented and popular projects to reduce this risk.

It’s all a pipeline, baby.

The only way the above is possible is a paradigm shift from making one single package of code to thinking about the work as a process. There are discrete parts to a task, and the code is suited for that particular task.

I was inspired to change the way I thought about zeeker when I saw the flow chart for OpenLaw NZ’s Data Pipeline.

OpenLaw NZ’s data pipeline structure looks complicated, but it’s easy to follow for me!

It’s made of several AWS components and services (with some Azure). The steps are small, like receiving an event, sending it to a serverless function, putting the data in an S3 bucket, and then running another serverless function.

The key insight is to avoid building a monolith. I am not committed to building a single program or website. Instead, a project is broken into smaller parts. Each part is only intended to do a small task well. In this instance, zeekerscrapers is only a scraper. It looks at the webpage, takes the information already present on the web page, and saves or downloads the information. It doesn't bother with machine learning, displaying the results or any other complicated processing.

Besides using the right tool for the job, it is also easier to maintain.

The modularity also makes it simple to chop and change for different types of data. For example, you need to OCR a scanned PDF but don’t need to do that for a digital PDF. If the workflow is a pipeline, you can take that task out of the pipeline. Furthermore, some tasks, such as downloading a file, are standard fare. If you have a code you can reuse over several pipelines, you can save much coding time.

On the other hand, I would be relying heavily on cloud infrastructure to accomplish this, which is by no means cheap or straightforward.

Experiments continue

Photo by Alex Kondratiev / Unsplash

I have been quite busy lately, so I have yet to develop this at the pace I would like. For now, I have been converting pdpc-decisions to seeker. It’s been a breeze even though I took so much time.

On the other hand, my leisurely pace also allowed me to think about more significant issues, like what I can generalise and whether I will get bad vibes from this code in the future. Hopefully, the other scrapers can develop at breakneck speed once I complete thinking through the issues.

I have also felt more and more troubled by what to prioritise. Should I write more scrapers? Scrape what? Should I focus on adding more features to existing scrapers (like extracting entities and summarisation etc.)? When should I start writing the front end? When should I start advertising this project?

It’d be great to hear your comments. Meanwhile, keep watching this space!

#zeeker #Programming #PDPC-Decisions #Ideas #CloudComputing #LegalTech #OpenSource #scrapy #SQLModel #spaCy #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

If you spent long enough coding, you would meet this term: refactoring. The Agile Alliance defines it as “improving the internal structure of an existing program’s source code, while preserving its external behavior”. To paint a picture, it's like tending your garden. Get rid of some leaves, trim the hedges, and maybe add some accessories. It's still a garden, but it's further away from ruin.

In real life, I don't have a garden, and I also hate gardening. It's not the dirt, it's the work.

Similarly, I am also averse to refactoring. The fun is bringing your idea to life and figuring out the means to get there. Improving the work? I will do that some other day.

Lately, I have had the chance to revisit some work. In my latest post, I transform my pdpc-decisions work to scrapy. It's something I put off for nearly a year because I was not looking forward to learning a new framework to do something I had already accomplished.

Take your web scraping to a new level: Let’s play with scrapyChanging my code to scrapy, a web scraping framework for Python, was challenging but reaped many dividends.Love.Law.Robots.HoufuPlease don't be too put off by the cute spider picture.

In the end, the procrastination didn't make sense. I was surprised I completed the main body of code within a few days. It turned out that my previous experience writing my web scraper helped me to understand the scrapy framework better.

On the other hand, revisiting my old code made me realise how anachronistic my old programming habits were. The programmer in me in 2020 was much different than I am now. The code I would like to write now should get the job done and be easy to read and maintain.

I reckon in many ways, wanting the code to be so perfect that I could leave it out of my mind forever grew from my foundation as a lawyer. Filings, once submitted, can't be recalled. Contracts, once signed, are frozen in its time.

My experience with technology made this way of thinking seem obsolete. Our products are moulded by our circumstances, by what is available at the time. As things change, the story doesn't end; it's only delineated in chapters. The truth is that there will always be another filing, and a contract can always be amended.

I reckon that lawyers shouldn't be stuck in their old ways, and we should actively consider how to improve the way we work. As time goes by, what we have worked on becomes forgotten because it's hard to read and maintain. I think we owe it to society to ensure that our skills and knowledge are not forgotten, or at least ensure that the next person doesn't need to walk the same path repeatedly.

As I look into what else in my code needs refactoring, I think: does the code need to be changed because the circumstances have changed, or because I have changed? Sometimes, I am not sure. Honestly. 🤔

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuHere's a target for more refactoring!

#Newsletter #Lawyers #scrapy #WebScraping #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

☝🏼

Key takeaways:
Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.

Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.

There are thus a lot of articles which teach you how to scrape websites — here’s mine.

Automate Boring Stuff: Get Python and your Web Browser to download your judgementsThis post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wroteLove.Law.Robots.Houfu

After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.

GitHub – houfu/pdpc-decisions: Data Protection Enforcement Cases in SingaporeData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufuThe code repository of the original web scraper is available on GitHub.

I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.

Levelling up... is not easy.

Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.

To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.

Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.

Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.

Enter scrapy

The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.

Enter scrapy.

The scrapy project is of some vintage. It reached 1.0 in 2015 and is currently at version 2.6.2 (as of August 2022). Scrapy’s age is screaming at you when it recommends you to install it in a “virtual environment” (who doesn’t install anything in Python except in a virtual environment?). On the other hand, scrapy is stable and production ready. It’s one of the best pieces of Python software I have encountered.

I decided to port my original web scraper to scrapy. I anticipated spending lots of time reading documentation, failing and then giving up. It turned out that I spent more time procrastinating, and the actual work was pretty straightforward.

Transitioning to scrapy

Here’s another thing you would notice about scrapy’s age. It encourages you to use a command line tool to generate code. This command creates a new project:

scrapy startproject tutorial

This reminds me of Angular and the ng command. (Do people still do that these days?)

While I found these commands convenient, it also reminded me that the learning curve of such frameworks is quite steep. Scrapy is no different. In the original web scraper, I defined the application's entry point through the command line function. This seemed the most obvious place to start for me.

@click.command() @click.argument('action') def pdpcdecision(csv, download, corpus, action, root, extras, extracorpus, verbose): starttime = time.time() scraperesults = Scraper.scrape() if (action == 'all') or (action == 'files'): downloadfiles(options, scraperesults) if (action == 'all') or (action == 'corpus'): createcorpus(options, scraperesults) if extras and ((action == 'all') or (action == 'csv')): scraperextras(scraperesults, options) if (action == 'all') or (action == 'csv'): savescraperesultstocsv(options, scraperesults) diff = time.time() – starttime logger.info('Finished. This took {}s.'.format(diff))

The original code was shortened to highlight the process.

The organisation of a scrapy project is different. You can generate a new project with the command above. However, the spider does the web crawling, and you have to create that within your project separately. If you started coding, you would not find this intuitive.

For the spider, the starting point is a function which generates or yields requests. The code example below does a few things. First, we find out how many pages there are on the website. We then yield a request for each page by submitting data on a web form.

import requests import scrapy from scrapy import FormRequest

class CommissionDecisionSpider(scrapy.Spider): name = “PDPCCommissionDecisions”

def startrequests(self): defaultform_data = { “keyword”: “”, “industry”: “all”, “nature”: “all”, “decision”: “all”, “penalty”: “all”, “page”: “1 }

response = requests.post(CASELISTINGURL, data=defaultformdata)

if response.statuscode == requests.codes.ok: responsejson = response.json() totalpages = responsejson[“totalPages”]

for page in range(1, totalpages + 1): yield FormRequest(CASELISTINGURL, formdata=createform_data(page=page))

Now, you need to write another function that deals with requests and yields items, the standard data format in scrapy.

def parse(self, response, **kwargs): responsejson = response.json() for item in responsejson[“items”]: from datetime import datetime nature = [DPObligations(nature.strip()) for nature in item[“nature”].split(',')] if item[ “nature”] else “None” decision = [DecisionType(decision.strip()) for decision in item[“decision”].split(',')] if item[ “decision”] else “None” yield CommissionDecisionItem( title=item[“title”], summaryurl=f”https://www.pdpc.gov.sg{item['url']}“, publisheddate=datetime.strptime(item[“date”], '%d %b %Y'), nature=nature, decision=decision )

You now have a spider! (Scrapy’s Quotesbot example is more minimal than this)

Run the spider using this command in the project directory:

scrapy crawl PDPCCommissionDecisions -o output.csv

Using its default settings, the spider scraped the PDPC website in a zippy 60 seconds. That’s because it already implements multithreading, so you are not waiting for tasks to complete one at a time. The command above even gets you a file containing all the items you scraped with no additional coding.

Transitioning from a pure Python codebase to a scrapy framework takes some time. It might be odd at first to realise you did not have to code the writing of a CSV file or manage web requests. This makes scrapy an excellent framework — you can focus on the code that makes your spider unique rather than reinventing the essential parts of a web scraper, probably very poorly.

It’s all in the pipeline.

If being forced to write spiders in a particular way isn’t irritating yet, dealing with pipelines might be the last straw. Pipelines deal with a request that doesn’t involve generating items. The most usual pipeline component checks an item to see if it’s a duplicate and then drops it if that’s true.

Pipelines look optional, and you can even avoid the complexity by incorporating everything into the main code. It turns out that many operations can be expressed as components in a timeline. Breaking them up into parts also helps the program implement multithreading and asynchronous operations effectively.

In pdpc-decisions, it wasn’t enough to grab the data from the filter or search page. You’d need to follow the link to the summary page, which makes additional information and a PDF download available. I wrote a pipeline component for that:

class CommissionDecisionSummaryPagePipeline: def processitem(self, item, spider): adapter = ItemAdapter(item) soup = bs4.BeautifulSoup(requests.get(adapter[“summaryurl”]).text, features=“html5lib”) article = soup.find('article')

# Gets the summary from the decision summary page paragraphs = article.find(class='rte').findall('p') result = '' for paragraph in paragraphs: if not paragraph.text == '': result += re.sub(r'\s+', ' ', paragraph.text) break adapter[“summary”] = result

# Gets the respondent in the decision adapter[“respondent”] = re.split(r”\s+[bB]y|[Aa]gainst\s+“, article.find('h2').text, re.I)[1].strip()

# Gets the link to the file to download the PDF decision decisionlink = article.find('a') adapter[“decisionurl”] = f”https://www.pdpc.gov.sg{decision_link['href']}”

adapter[“fileurls”] = [f”https://www.pdpc.gov.sg{decisionlink['href']}“]

return item

This component takes an item, visits the summary page and grabs the summary, respondent’s name and the link to the PDF, which contains the entire decision.

Note also the item has a field called file_urls. I did not create this data field. It’s a field used to tell scrapy to download a file from the web.

You can activate pipeline components as part of the spider’s settings.

ITEM_PIPELINES = { 'pdpcSpider.pipelines.CommissionDecisionSummaryPagePipeline': 300, 'pdpcSpider.pipelines.PDPCDecisionDownloadFilePipeline': 800, }

In this example, the pipeline has two components. Given a priority of 300, the CommissionDecisionSummaryPagePipeline goes first. PDPCDecisionDownloadFilePipeline then downloads the files listed in the file_urls field we referred to earlier.

Note also that PDPCDecisionDownloadFilePipeline is an implementation of the standard FilesPipeline component provided by scrapy, so I didn’t write any code to download files on the internet. Like the CSV feature, scrapy downloads the files when its files pipeline is activated.

Once again, it’s odd not to write code to download files. Furthermore, writing components for your pipeline and deciding on their seniority in a settings file isn’t very intuitive if you’re not sure what’s going on. Once again, I am grateful that I did not have to write my own pipeline.

I would note that “pipeline” is a fancy term for describing what your program is probably already doing. It’s true — in the original pdpc-decisions, the pages are scraped, the files are downloaded and the resulting items are saved in a CSV file. That’s a pipeline!

Settings, settings everywhere

Someone new to the scrapy framework will probably find the settings file daunting. In the previous section, we introduced the setting to define the seniority of the components in a pipeline. If you’re curious what else you can do in that file, the docs list over 50 items.

I am not going to go through each of them in this article. To me, though, the number of settings isn’t user-friendly. Still, it hints at the number of things you can do with scrapy, including randomising the delay before downloading files from the same site, logging and settings for storage adapters to common backends like AWS or FTP.

As a popular and established framework, you will also find an ecosystem. This includes scrapyd, a service you can run on your server to schedule scrapes and run your spiders. Proxy services are also available commercially if your operations are very demanding.

There are lots to explore here!

Conclusion

Do I have any regrets about doing pdpc-decisions? Nope. I learned a lot about programming in Python doing it. It made me appreciate what special considerations are involved in web scraping and how scrapy was helping me to do that.

I also found that following a framework made the code more maintainable. When I revisited the original pdpc-decisions while writing this article, I realised the code didn’t make sense. I didn’t name my files or function sensibly or write tests which showed what the code was doing.

Once I became familiar with the scrapy framework, I knew how to look for what I wanted in the code. This extends to sharing — if everyone is familiar with the framework, it’s easier for everyone to get on the same page rather than learning everything I wrote from scratch.

Scrapy could afford power and convenience, which is specialised for web scraping. I am going to keep using it for now. Learning such a challenging framework is already starting to pay dividends.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuRead more interesting adventures.

#Programming #Python #WebScraping #DataScience #Law #OpenSource #PDPC-Decisions #scrapy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

I have been mulling over developing an extensive online database of free legal materials in the flavour of OpenLawNZ or an LII for the longest time. Free access to such materials is one problem to solve, but I'm also hoping to compile a dataset to develop AI solutions. I have tried and demonstrated this with PDPC's data previously, and I am itching to expand the project sustainably.

However, being a lawyer, I am concerned about the legal implications of scraping government websites. Would using these materials be a breach of copyright law? In other countries, people accept that the public should generally be allowed to use such public materials. However, I am not very sure of this here.

The text steps highlightedPhoto by Clayton Robbins / Unsplash

I was thus genuinely excited about the amendments to the Copyright Act in Singapore this year. According to the press release, they will be operational in November, so they will be here soon.

Copyright Bill – Singapore Statutes OnlineSingapore Statutes Online is provided by the Legislation Division of the Singapore Attorney-General’s ChambersSingapore Statutes OnlineThe Copyright Bill is expected to be operationalised in November 2021.

[ Update 21 November 2021: The bill has, for the most part, been operationalised.]

Two amendments are particularly relevant in my context:

Using publicly disclosed materials from the government is allowed

In sections 280 to 282 of the Bill, it is now OK to copy or communicate public materials to facilitate more convenient viewing or hearing of the material. It should be noted that this is limited to copying and communicating it. Presumably, this means that I can share the materials I collected on my website as a collection.

Computational data analysis is allowed.

The amendments expressly say that using a computer to extract data from a work is now permitted. This is great! At some level, the extraction of the material is to perform some analysis or computation on it — searching or summarising a decision etc. I think some limits are reasonable, such as not communicating the material itself or using it for any other purpose.

However, one condition stands out for me — I need “lawful access” to the material in the first place. The first illustration to explain this is circumventing paywalls, which isn’t directly relevant to me. The second illustration explains that obtaining the materials through a breach of the terms of use of a database is not “lawful access”.

That’s a bit iffy. As you will see in the section surveying terms, a website’s terms are not always clear about whether access is lawful or not. The “terms of use” of a website are usually given very little thought by its developers or implemented in a maximal way that is at once off-putting and misleading. Does trying to beat a captcha mean I did not get lawful access? Sure, it’s a barrier to thwart robots, but what does it mean? If a human helps a robot, would it still be lawful?

A recent journal article points to “fair use” as the way forward

I was amazed to find an article in the SAL Journal titled “Copying Right in Copyright Law” by Prof David Tan and Mr Thomas Lee, which focused on the issue that was bothering me. The article focuses on data mining and predictive analytics, and it substantially concerns robots and scrapers.

Singapore Academy of Law Journale-First MenuLink to the journal article on E-First at SAL Journals Online.

On the new exception for computational data analysis, the article argues that the two illustrations I mentioned earlier were “inadequate and there is significant ambiguity of what lawful access means in many situations”. Furthermore, because the illustrations were not illuminating, it might create a situation where justified uses are prohibited. With much sadness, I agree.

More interestingly, based on some mathematics and a survey, the authors argue that an open-ended general fair use defence for data mining is the best way forward. As opposed to a rule-based exception, such a defence can adapt to changes better. Stakeholders (including owners) also prefer it because it appeals to their understanding of the economic basis of data mining.

You can quibble with the survey methodology and the mathematics (which I think is very brave for a law journal article). I guess it served its purpose in showing the opinion of stakeholders in the law and the cost analysis very well. I don’t suspect it will be cited in a court judgement soon, but hopefully, it sways someone influential.

We could use a more developer-friendly approach.

Photo by Mimi Thian / Unsplash

There was a time when web scraping was dangerous for a website. In those times, websites can be inundated with requests by automated robots, leading them to crash. Since then, web infrastructure has improved, and techniques to defeat malicious actors have been developed. The great days of “slashdotting” a website has not been heard of for a while. We’ve mostly migrated to more resilient infrastructure, and any serious website on the internet understands the value of having such infrastructure.

In any case, it is possible to scrape responsibly. Scrapy, for example, allows you to queue requests regularly or identify yourself as a robot or scraper, respecting robots.txt. If I agreed not to degrade a website’s performance, which seems quite reasonable, shouldn’t I be allowed to use it?

Being more developer-friendly would also help government agencies find more uses for their works. For now, most legal resources appear to cater exclusively for lawyers. Lawyers will, of course, find them most valuable because it’s part of their job. However, others may also need such resources because they can’t afford lawyers or have a different perspective on how information can be helpful. It’s not easy catering to a broader or other audience. If a government agency doesn’t have the resources to make something more useful, shouldn’t someone else have a go? Everyone benefits.

Surveying the terms of use of government websites

RTK survey in quarryPhoto by Valeria Fursa / Unsplash

Since “lawful access” and, by extension, “terms of use” of a website will be important in considering the computational data analysis exceptions, I decided to survey the terms of use of various government agencies. After locating their treatment of the intellectual property rights of their materials, I gauge my appetite to extract them.

In all, I identified three broad categories of terms.

Totally Progressive: Singapore Statutes Online 👍👍👍

Source: https://sso.agc.gov.sg/Help/FAQ#FAQ_8 (Accessed 20 October 2021)

Things I like:

  • They expressly mention the use of “automated means”. It looks like they were prepared for robots!
  • Conditions appear reasonable. There’s a window for extraction and guidelines to help properly cite and identify the extracted materials.

Things I don’t like:

  • The Singapore Statutes Online website is painful to extract from and doesn’t feature any API.

Comments:

  • Knowing what they expect scrapers to do gives me confidence in further exploring this resource.
  • Maybe the key reason these terms of use are excellent is that it applies to a specific resource. If a resource owner wants to make things developer-friendly, they should consider their collections and specify their terms of use.

Totally Bonkers: Personal Data Protection Commission 😖😖😖

Source: https://www.pdpc.gov.sg/Terms-and-Conditions (Accessed 20 October 2021)

Things I like:

  • They expressly mention the use of “robots” and “spiders”. It looks like they were prepared!

Things I don’t like:

  • It doesn’t allow you to use a “manual process” to monitor its Contents. You can’t visit our website to see if we have any updates!
  • What is an automatic device? Like a feed reader? (Fun fact: The PDPC obliterated their news feed in the latest update to their website. The best way to keep track of their activities is to follow their LinkedIn)
  • PDPC suggests that you get written permission but doesn’t tell you what circumstances they will give you such permission.
  • I have no idea what an unreasonable or disproportionately large load is. It looks like I have to crash the server to find out! (Just kidding, I will not do that, OK.)

Comments:

  • I have no idea what happened to the PDPC, such that it had to impose such unreasonable conditions on this activity (I hope I am not involved in any way 😇). It might be possible that someone with little knowledge went a long way.
  • At around paragraph 6, there is a somewhat complex set of terms allowing a visitor to share and use the contents of the PDPC website for non-commercial purposes. This, however, still does not gel with this paragraph 20, and the confusion is not user or developer-friendly, to say the least.
  • You can’t contract out fair use or the computational data analysis exception, so forget it.
  • I’m a bit miffed when I encounter such terms. Let’s hope their technical infrastructure is as well thought out as their terms of use. (I’m being ironic.)

Totally Clueless: Strata Titles Board 🎈🎈🎈

Materials, including source code, pages, documents and online graphics, audio and video in The Website are protected by law. The intellectual property rights in the materials is owned by or licensed to us. All rights reserved. (Government of Singapore © 2006).
Apart from any fair dealings for the purposes of private study, research, criticism or review, as permitted in law, no part of The Website may be reproduced or reused for any commercial purposes whatsoever without our prior written permission.

Source: https://www.stratatb.gov.sg/terms-of-use.html# (Accessed 20 October 2021)

Things I like:

  • Mentions fair dealing as permitted by law. However, they have to update to “fair use” or “permitted use” once the new Copyright Act is effective.

Things I don’t like:

  • Not sure why it says “Government of Singapore ©️ 2006”. Maybe they copied this terms of use statement in 2006 and never updated it since?
  • You can use the information for “commercial purposes” if you get written permission. It doesn’t tell you in what circumstances they will give you such permission. (This is less upsetting than PDPC’s terms.)
  • It doesn’t mention robots, spiders or “automatic devices”.

Comments:

  • It’s less upsetting than a bonkers terms of use, but it doesn’t give me confidence or an idea of what to expect.
  • The owner probably has no idea what data mining, predictive analytics etc., are. They need to buy the new “Law and Technology” book.

Conclusion

One might be surprised to find that terms of using a website, even when supposedly managed by lawyers, feature unclear, problematic, misleading, and unreasonable terms. As I mentioned, very little thought goes into drafting such terms most of the time. However, they provide obstacles to others who may want to explore new uses of a website or resource. Hopefully, more owners will proactively clean up their sites once the new Copyright Act becomes effective. In the meantime, this area provides lots of risks for a developer.

#Law #tech #Copyright #DataScience #Government #WebScraping #scrapy #Singapore #PersonalDataProtectionCommission #StrataTitlesBoard #DataMining

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu