SQLModel — Love.Law.Robots. by Ang Hou Fu

Introducing: zeekerscrapers

November 27, 2022

Feature image

I’ve wanted to pen down my thoughts on the next stage of the evolution of my projects for some time. Here I go!

What’s next after pdpc-decisions?

I had a lot of fun writing pdpc-decisions. It scraped data from the Personal Data Protection Commission’s enforcement decisions web page and produced a table, downloads and text. Now I got my copy of the database! From there, I made visualisations, analyses and fun graphs.

All for free.

The “free” includes the training I got coding in Python and trying out various stages of software development, from writing tests to distributing a package as a module and a docker container.

In the lofty “what’s next” section of the post, I wrote:

The ultimate goal of this code, however leads to my slow-going super-project, which I called zeeker. It’s a database of personal data protection resources in the cloud, and I hope to expand on the source material here to create an even richer database. So this will not be my last post on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interest in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it helpful so if there is an interest, please let me know!

What has happened since then?

For one, personal data protection commission decisions are not interesting enough for me. Since working on that project, the deluge of decisions has trickled as the PDPC appeared to have changed its focus to compliance and other cool techy projects.

Furthermore, there are much more interesting data out there: for example, the PDPC has created many valuable guidelines which are currently unsearchable. As Singapore’s rules and regulations grow in complexity, there’s much hidden beneath the surface. The zeeker project shouldn’t just focus on a narrow area of law or judgements and decisions.

In terms of system architecture, I made two other decisions.

Use more open-source libraries, and code less.

I grew more confident in my coding skills doing pdpc-decisions, but I used a few basic libraries and hacked my way through the data. When I look back at my code, it is unmaintainable. Any change can break the library, and the bog of whacked-up coding made it hard for me to understand what I was doing several months later. Tests, comments and other documentation help, but only if you’re a disciplined person. I’m not that kind of guy.

Besides writing code (which takes time and lots of motivation), I could also “piggyback” on the efforts of others to create a better stack. The stack I’ve decided so far has made coding more pleasant.

There are also other programs I would like to try — for example, I plan to deliver the data through an API, so I don’t need to use Python to code the front end. A Javascript framework like Next.JS would be more effective for developing websites.

Decoupling the project with the programming language also expands the palette of tools I can have. For example, instead of using a low-level Python library like pdfminer to “plumb” a PDF, I could use a self-hosted docker container like parsr to OCR or analyse the PDF and then convert it to text.

It’s about finding the best tool for the job, not depending only on my (mediocre) programming skills to bring results.

There’s, of course, an issue of technical debt (if parsr is not being developed anymore, my project can slow down as well). I think this is not so bad because all the projects I picked are open-source. I would also pick well-documented and popular projects to reduce this risk.

It’s all a pipeline, baby.

The only way the above is possible is a paradigm shift from making one single package of code to thinking about the work as a process. There are discrete parts to a task, and the code is suited for that particular task.

I was inspired to change the way I thought about zeeker when I saw the flow chart for OpenLaw NZ’s Data Pipeline.

OpenLaw NZ’s data pipeline structure looks complicated, but it’s easy to follow for me!

It’s made of several AWS components and services (with some Azure). The steps are small, like receiving an event, sending it to a serverless function, putting the data in an S3 bucket, and then running another serverless function.

The key insight is to avoid building a monolith. I am not committed to building a single program or website. Instead, a project is broken into smaller parts. Each part is only intended to do a small task well. In this instance, zeekerscrapers is only a scraper. It looks at the webpage, takes the information already present on the web page, and saves or downloads the information. It doesn't bother with machine learning, displaying the results or any other complicated processing.

Besides using the right tool for the job, it is also easier to maintain.

The modularity also makes it simple to chop and change for different types of data. For example, you need to OCR a scanned PDF but don’t need to do that for a digital PDF. If the workflow is a pipeline, you can take that task out of the pipeline. Furthermore, some tasks, such as downloading a file, are standard fare. If you have a code you can reuse over several pipelines, you can save much coding time.

On the other hand, I would be relying heavily on cloud infrastructure to accomplish this, which is by no means cheap or straightforward.

Experiments continue

Photo by Alex Kondratiev / Unsplash

I have been quite busy lately, so I have yet to develop this at the pace I would like. For now, I have been converting pdpc-decisions to seeker. It’s been a breeze even though I took so much time.

On the other hand, my leisurely pace also allowed me to think about more significant issues, like what I can generalise and whether I will get bad vibes from this code in the future. Hopefully, the other scrapers can develop at breakneck speed once I complete thinking through the issues.

I have also felt more and more troubled by what to prioritise. Should I write more scrapers? Scrape what? Should I focus on adding more features to existing scrapers (like extracting entities and summarisation etc.)? When should I start writing the front end? When should I start advertising this project?

It’d be great to hear your comments. Meanwhile, keep watching this space!

#zeeker #Programming #PDPC-Decisions #Ideas #CloudComputing #LegalTech #OpenSource #scrapy #SQLModel #spaCy #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

Let’s Play With SQLModel

January 11, 2022

Feature image

As I continued with my project of dealing with 5 million Monopoly Junior games, I had a problem. Finding a way to play hundreds of games per second was one thing. How was I going to store all my findings?

Limitations of using a CSV File

Initially, I used a CSV (Comma Separated Values) file to store the results of every game. Using a CSV file is straightforward, as Python and pandas can load them quickly. I could even load them using Excel and edit them using a simple text editor.

However, every time my Streamlit app tried to load a 5GB CSV file as a pandas dataframe, the tiny computer started to gasp and pant. If you try to load a huge file, your application might suffer performance issues or failures. Using a CSV to store a single table seems fine. However, once I attempted anything more complicated, like the progress of several games, its limitations became painfully apparent.

Let’s use SQL instead!

The main alternative to using a CSV is to store data in a database. Of all the databases out there, SQL is the biggest dog of them all. Several applications, from web applications to games, use some flavour of SQL. Python, a “batteries included” programming language, even has a module for processing SQLite — basically a light SQL database you can store as a file.

SQL doesn’t require all its data to be loaded to start using the data. You can use indexes to search data. This means your searches are faster and less taxing on the computer. Doable for a little computer!

Most importantly, you use data in a SQL database by querying it. This means I can store all sorts of data in the database without worrying that it would bog me down. During data extraction, I can aim for the maximum amount of data I can find. Once I need a table from the database, I would ask the database to give me a table containing only the information I wanted. This makes preparing data quick. It also makes it possible to explore the data I have.

Why I procrastinated on learning SQL

To use a SQL database, you have to write operations in the SQL language, which looks like a sentence of gobbledygook to the untrained eye.

I'll stick to symbols in my children's names please, thanks.

SQL is also in the top 10 on the TIOBE index of programming languages. Higher than Typescript, at the very least.

I have heard several things about SQL — it’s similar to Excel formulas. However, I dreaded learning a new language to dig a database. The landscape of SQL was also very daunting. There are several “flavours” of SQL, and I was not sure of the difference between PostgreSQL, MySQL or MariaDB.

There are ORM (Object-relational mapping) tools for SQL for people who want to stick to their programming languages. ORMs allow you to use SQL with your favourite programming language. For Python, the main ORM is SQLAlchemy. Seeing SQL operations in my everyday programming language was comforting at first, but I found the documentation difficult to follow.

Furthermore, I found other alternative databases easier to pick up and deploy. These included MongoDB, a NoSQL database. They rely on a different concept — data is stored in “documents” instead of tables, and came with a full-featured ORM. For many web applications, the “document” idiom applied well. It wouldn’t make sense, though, if I wanted a table.

Enter SQLModel!

A few months ago, an announcement from the author of FastAPI excited me — he was working on SQLModel. SQLModel would be a game-changer if it were anything like FastAPI. FastAPI featured excellent documentation. It was also so developer-friendly that I didn’t worry about the API aspects in my application. If SQLModel could reproduce such an experience for SQL as FastAPI did for APIs with Python, that would be great!

SQLModelSQLModel, SQL databases in Python, designed for simplicity, compatibility, and robustness.logo

As it turned out, SQLModel was a very gentle introduction to SQL. The following code creates a connection to an SQLite database you would create in your file system.

from sqlmodel import SQLModel, create_engine, Session from sqlalchemy.engine import Engine

engine: Optional[Engine] = None

def createDBengine(filename: str): global engine engine = createengine(f'sqlite:///{filename}') SQLModel.metadata.createall(engine) return engine

Before creating a connection to a database, you may want to make some “models”. These models get translated to a table in your SQL database. That way, you will be working with familiar Python objects in the rest of your code while the SQLModel library takes care of the SQL parts.

The following code defines a model for each game played.

from sqlmodel import SQLModel, Field

class Game(SQLModel, table=True): id: Optional[int] = Field(default=None, primarykey=True) name: str = Field() parent: Optional[str] = Field() numof_players: int = Field() rounds: Optional[int] = Field() turns: Optional[int] = Field() winner: Optional[int] = Field()

So, every time the computer finished a Monopoly Junior game, it would store the statistics as a Game. (It’s the line where the result is assigned.)

def playbasicgame(numofplayers: Literal[2, 3, 4], turns: bool) –> Tuple[Game, List[Turn]]: if numofplayers == 3: game = ThreePlayerGame() elif numofplayers == 4: game = FourPlayerGame() else: game = TwoPlayerGame() gameid = uuid.uuid4() logging.info(f'Game {gameid}: Started a new game.') endturn, gameturns = playrounds(game, gameid, turns) winner = decidewinner(endturn) result = Game(name=str(gameid), numofplayers=numofplayers, rounds=endturn.getvalue(GameData.ROUNDS), turns=endturn.getvalue(GameData.TURNS), winner=winner.value) logging.debug(f'Game {gameid}: {winner} is the winner.') return result, game_turns

After playing a bunch of games, these games get written to the database in an SQL session.

def write_games(games: List[Game]): with Session(engine) as session: for game in games: session.add(game) session.commit()

Reading your data from the SQLite file is relatively straightforward as well. If you were writing an API with FastAPI, you could pass the model as a response model, and you can get all the great features of FastAPI directly.

I had already stored some Game data in “games.db” in the following code. I created a backend server that would read these files and return all the Turns belonging to a Game. This required me to select all the turns in the game that matched a unique id. (As you might note in the previous section, this is a UUID.)

gamesengine = createengine('sqlite:///games.db')

@app.get(“/game/{gamename}“, responsemodel=List[Turn]) def getturns(gamename: str): “”” Get a list of turns from :param gamename. “”” with Session(gamesengine) as session: checkgameexists(session, gamename) return session.exec(select(Turn).where(Turn.game == gamename)).all()

Limitations of SQLModel

Of course, being marked as version “0.0.6”, this is still early days in the development of SQLModel. The documentation is also already helpful but still a work in progress. A key feature that I would be looking out for is migrating data, most notably for different software versions. This problem can get very complex quickly, so anything would be helpful.

I also found creating the initial tables very confusing. You create models by implementing descendants of the SQLModel class in the library, import these models into the main SQLModel class, and create the tables by calling SQLModel.metadata.create_all(engine). This doesn’t appear pretty pythonic to me.

Would I continue to use SQLModel?

There are use cases where SQLModel will now be applicable. My Monopoly Junior data project is one beneficiary. The library provided a quick, working solution to use a SQL database to store results and data without needing to get too deep and dirty into SQL. However, if your project is more complicated, such as a web application, you might seriously consider its current limitations.

Even if I might not have the opportunity to use SQLModel in the future, there were benefits from using it now:

I became more familiar with the SQL language after seeing it in action and also comparing an SQL statement with its SQLModel equivalent in the documentation. Now I can write simple SQL statements!
SQLModel is described as a combination of SQLAlchemy and Pydantic. Once I realised that many of the features of SQLModel are extensions of SQLAlchemy’s, I was able to read and digest SQLAlchemy’s documentation and library. If I can’t find what I want in SQLModel now, I could look into SQLAlchemy.

Conclusion

Storing my Monopoly Junior games’ game and turn data became straightforward with SQLModel. However, the most significant takeaway from playing with SQLModel is my increased experience using SQL and SQLAlchemy. If you ever had difficulty getting onto SQL, SQLModel can smoothen this road. How has your experience with SQLModel been? Feel free to share with me!

#Programming #DataScience #Monopoly #SQL #SQLModel #FastAPI #Python #Streamlit

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter