Data scientists working in Python or R typically acquire data by way of REST APIs. Both environments provide libraries that help you make HTTP calls to REST endpoints, then transform JSON responses into dataframes. But that’s never as simple as we’d like. When you’re reading a lot of data from a REST API, you need to do it a page at a time, but pagination works differently from one API to the next. So does unpacking the resulting JSON structures. HTTP and JSON are low-level standards, and REST is a loosely-defined framework, but nothing guarantees absolute simplicity, never mind consistency across APIs.
What if there were a way of reading from APIs that abstracted all the low-level grunt work and worked the same way everywhere? Good news! That is exactly what Steampipe does. It’s a tool that translates REST API calls directly into SQL tables. Here are three examples of questions that you can ask and answer using Steampipe.
1. Twitter: What are recent tweets that mention PySpark?
Here’s a SQL query to ask that question:
select id, text
from twitter_search_recent
where query = 'pyspark'
order by created_at desc
limit 5;
Here’s the answer:
+---------------------+------------------------------------------------------------------------------------------------>
| id | text >
+---------------------+------------------------------------------------------------------------------------------------>
| 1526351943249154050 | @dump Tenho trabalhando bastante com Spark, mas especificamente o PySpark. Vale a pena usar um >
| 1526336147856687105 | RT @MitchellvRijkom: PySpark Tip ⚡ >
| | >
| | When to use what StorageLevel for Cache / Persist? >
| | >
| | StorageLevel decides how and where data should be s… >
| 1526322757880848385 | Solve challenges and exceed expectations with a career as a AWS Pyspark Engineer. https://t.co/>
| 1526318637485010944 | RT @JosMiguelMoya1: #pyspark #spark #BigData curso completo de Python y Spark con PySpark >
| | >
| | https://t.co/qf0gIvNmyx >
| 1526318107228524545 | RT @money_personal: PySpark & AWS: Master Big Data With PySpark and AWS >
| | #ApacheSpark #AWSDatabases #BigData #PySpark #100DaysofCode >
| | -> http… >
+---------------------+------------------------------------------------------------------------------------------------>
The table that’s being queried here, twitter_search_recent, receives the output from Twitter’s /2/tweets/search/recent endpoint and formulates it as a table with these columns. You don’t have to make an HTTP call to that API endpoint or unpack the results, you just write a SQL query that refers to the documented columns. One of those columns, query
, is special: it encapsulates Twitter’s query syntax. Here, we are just looking for tweets that match PySpark but we could as easily refine the query by pinning it to specific users, URLs, types (is:retweet
, is:reply
), properties (has:mentions
, has_media
), etc. That query syntax is the same no matter how you’re accessing the API: from Python, from R, or from Steampipe. It’s plenty to think about, and all you should really need to know when crafting queries to mine Twitter data.
2. GitHub: What are repositories that mention PySpark?
Here’s a SQL query to ask that question:
select name, owner_login, stargazers_count from github_search_repository where query = 'pyspark' order by stargazers_count desc limit 10;
Here’s the answer:
+----------------------+-------------------+------------------+
| name | owner_login | stargazers_count |
+----------------------+-------------------+------------------+
| SynapseML | microsoft | 3297 |
| spark-nlp | JohnSnowLabs | 2725 |
| incubator-linkis | apache | 2524 |
| ibis | ibis-project | 1805 |
| spark-py-notebooks | jadianes | 1455 |
| petastorm | uber | 1423 |
| awesome-spark | awesome-spark | 1314 |
| sparkit-learn | lensacom | 1124 |
| sparkmagic | jupyter-incubator | 1121 |
| data-algorithms-book | mahmoudparsian | 1001 |
+----------------------+-------------------+------------------+
This looks very similar to the first example! In this case, the table that’s being queried, twitter_search_repository, receives the output from GitHub’s /search/repositories endpoint and formulates it as a table with these columns.
In both cases the Steampipe documentation not only shows you the schemas that govern the mapped tables, it also gives examples (Twitter, GitHub) of SQL queries that use the tables in various ways.
Note that these are just two of many available tables. The Twitter API is mapped to 7 tables, and the GitHub API is mapped to 37 tables.
3. Twitter + GitHub: What have owners of PySpark-related repositories tweeted lately?
To answer this question we need to consult two different APIs, then join their results. That’s even harder to do, in a consistent way, when you’re reasoning over REST payloads in Python or R. But this is the kind of thing SQL was born to do. Here’s one way to ask the question in SQL.
-- find pyspark repos
with github_repos as ( select name, owner_login, stargazers_count from github_search_repository where query = 'pyspark' and name ~ 'pyspark' order by stargazers_count desc limit 50
), -- find twitter handles of repo owners
github_users as ( select u.login, u.twitter_username from github_user u join github_repos r on r.owner_login = u.login where u.twitter_username is not null
), -- find corresponding twitter users select id from twitter_user t join github_users g on t.username = g.twitter_username
) -- find tweets from those users
select t.author->>'username' as twitter_user, 'https://twitter.com/' || (t.author->>'username') || '/status/' || t.id as url, t.text
from twitter_user_tweet t
join twitter_userids u
on t.user_id = u.id
where t.created_at > now()::date - interval '1 week'
order by t.author
limit 5
Here is the answer:
+----------------+---------------------------------------------------------------+------------------------------------->
| twitter_user | url | text >
+----------------+---------------------------------------------------------------+------------------------------------->
| idealoTech | https://twitter.com/idealoTech/status/1524688985649516544 | Are you able to find creative soluti>
| | | >
| | | Join our @codility Order #API Challe>
| | | >
| | | #idealolife #codility #php >
| idealoTech | https://twitter.com/idealoTech/status/1526127469706854403 | Our #ProductDiscovery team at idealo>
| | | >
| | | Think you can solve it? 😎 >
| | | ➡️ https://t.co/ELfUfp94vB https://t>
| ioannides_alex | https://twitter.com/ioannides_alex/status/1525049398811574272 | RT @scikit_learn: scikit-learn 1.1 i>
| | | What's new? You can check the releas>
| | | >
| | | pip install -U… >
| andfanilo | https://twitter.com/andfanilo/status/1524999923665711104 | @edelynn_belle Thanks! Sometimes it >
| andfanilo | https://twitter.com/andfanilo/status/1523676489081712640 | @juliafmorgado Good luck on the reco>
| | | >
| | | My advice: power through it + a dead>
| | | >
| | | I hated my first few short videos bu>
| | | >
| | | Looking forward to the video 🙂
When APIs frictionlessly become tables, you can devote your full attention to reasoning over the abstractions represented by those APIs. Larry Wall, the creator of Perl, famously said: “Easy things should be easy, hard things should be possible.” The first two examples are things that should be, and are, easy: each is just 10 lines of simple, straight-ahead SQL that requires no wizardry at all.
The third example is a harder thing. It would be hard in any programming language. But SQL makes it possible in several nice ways. The solution is made of four concise stanzas (CTEs, Common Table Expressions) that form a pipeline. Each phase of the pipeline handles one clearly-defined piece of the problem. You can validate the output of each phase before proceeding to the next. And you can do all this with the most mature and widely-used grammar for selection, filtering, and recombination of data.
Do I have to use SQL?
No! If you like the idea of mapping APIs to tables, but you would rather reason over those tables in Python or R dataframes, then Steampipe can oblige. Under the covers it’s Postgres, enhanced with foreign data wrappers that handle the API-to-table transformation. Anything that can connect to Postgres can connect to Steampipe, including SQL drivers like Python’s psycopg2
and R’s RPostgres
as well as business-intelligence tools like Metabase, Tableau, and PowerBI. So you can use Steampipe to frictionlessly consume APIs into dataframes, then reason over the data in Python or R.
But if you haven’t used SQL in this way before, it’s worth a look. Consider this comparison of SQL to Pandas from How to rewrite your SQL queries in Pandas.
SQL | Pandas |
---|---|
select * from airports | airports |
select * from airports limit 3 | airports.head(3) |
select id from airports where ident = ‘KLAX’ | airports[airports.ident == ‘KLAX’].id |
select distinct type from airport | airports.type.unique() |
select * from airports where iso_region = ‘US-CA’ and type = ‘seaplane_base’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘seaplane_base’)] |
select ident, name, municipality from airports where iso_region = ‘US-CA’ and type = ‘large_airport’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]] |
We can argue the merits of one style versus the other, but there’s no question that SQL is the most universal and widely-implemented way to express these operations on data. So no, you don’t have to use SQL to its fullest potential in order to benefit from Steampipe. But you might find that you want to.
Source of Article