Short Story

from:

Ideascale link

Problem statement :

Collection and democratization of data in the Cardano ecosystem will be critical for feeding new ideas and measuring critical project KPIs

Describe your solution to the problem

Modified Stakepool + Data Pipeline to collect data from 2 main sources. The SP blockchain events and github repo commits / activities.

Relevant experience

I’ve been working in the mobile gaming space for 8 years creating client and server architectures at petabyte scale with both GCP and AWS

Detailed plan (not required) – Fill in here any additional details

Youtube Proposal Overview:

 

Youtube Proposal BigQuery Examples:

 

Once complete anyone will be able to ask questions like the following (below) and get fast / accurate answers using SQL via BigQuery public datasets for your Cardano projects (both before you ship and after while you operate in real-time). You’ll be able to use the BigQuery UI, Google Sheets – BigQuery connector, Tableau, Mode, Python BigQuery sdks, jupyter notebooks, and all the other plethora of options around BigQuery and Big Data tools/platforms.

Q: What are the 10 most popular Cardano assets, by asset type and number of transactions?

A: Queries will be fast and free for most common use cases, but do require a gmail account + default gcp project to have access (for this phase) to the BigQuery UI directly. Again, you can access the data via other methods as seen in the Youtube Example video. Teams that use the public BigQuery data significantly more can also pay google via their GCP project $5 USD per TB scanned. This will be available anywhere BigQuery is available via this interface (again for this phase).

————————————————————————————————————-

SELECT assets.address,

assets.type,

COUNT(1) as tx_count

FROM `crypto-data-prod.crypto_cardano.assets` AS assets

JOIN `bigquery-public-data.crypto_cardano.transactions` AS transactions ON (transactions.to_address = assets.address)

WHERE assets.type IN (‘x’,’y’,’z’)

GROUP BY 1,2

ORDER BY tx_count DESC

LIMIT 10

————————————————————————————————————-

Q: But how much will it cost me to use BigQuery?

A: BigQuery is Free for the first 1TB scanned per month per project/account. After that it’s $5 USD per 1TB scanned. All of the current RAW Cardano chain (mainnet) is 7GB of data. This isn’t always the most useful way to discover data. However an additional summary and aggregate tables will be also created / loaded (streaming and/or batch) and will likely be much smaller tables. My experience is that BigQuery is a solid data warehouse that scales both in storage and compute at a reasonable cost to maintain and I expect the Cardano community would appreciate simple and cost effective approaches to these problems.

Additionally, it appears Kaggle allows for direct public access to the BigQuery data without a cost. Will need to confirm. See my simple example https://www.kaggle.com/chbourkefloydiv/cardano-bigquery-simple

Q: Ok interesting, but this sounds complicated … how will it work?

A: Well great news! IOHK already has a model for how to sync the blockchain to a SQL database https://github.com/input-output-hk/cardano-db-sync .

Initially I’ll work to see if Haskell only solution works much like the cardano-db-sync repo uses. However, I suspect that will actually not scale to many target db/formats and allow for the clean implementation. So, I’ll likely end up using the serialization lib to load the data from disk into python which will allow the Cardano ETL project to target just about any potential target (BigQuery, Athena, Json, …). This is more or less the Ethereum ETL approach https://github.com/blockchain-etl/ethereum-etl . Worst case I’ll end up needing to improve/add features to the serialization lib to allow for the secondary approach.

All of this will be open source and I welcome contributions / feedback along the way. At the end of the day what I really would like is a analytics API built into Cardano that could have multiple listeners to report to multiple databases or logs. So I’ll drive in that direction with feedback from the IOHK and community teams and potentially have a direct contribution back into Cardano master #LifeGoals

So that would solve the blockchain / metadata sync to BigQuery, but I would also like to start collecting other data and sharing it in BigQuery around the Cardano ecosystem. So I’ll also explore this as a stretch goal to be included as a separate public BigQuery dataset. The main reason I think this is useful is for helping the larger community understand how really inspiring I think the Cardano development community has been and help share their amazing accomplishments in a way that more data oriented individuals can absorb and make useful.

Q: What does this have to do with the “Developer Ecosystem” impact / campaign?

A: Democratization of the blockchain data will inspire and drive Cardano adoption because it will provide insights to the real-time operations of Cardano AND new perspectives to what is possible as Cardano. There are a ton of data scientists, data engineers, and developers that are interested in blockchains, but are more comfortable in their existing Big Data infrastructures.

Even more so in games that are largely data driven projects.

Intent with Insights will be a growth engine of Cardano. We will need to separate fact from fiction along this road and build a foundation of a data driven community (the most!) to really unlock our true potential.We will need to do this both before the projects kickoff in some cases, but also while they are in operations with “healthcheck” and other useful KPI’s.

Q: What will the UI look like?

A: See the YouTube Examples Video above or the comments below for a BigQuery UI example. Technically you can use just about any tool to query BigQuery so it’s up to you (Tableau, Python, …)

Q: What are example tables / schemas?

A:

Example tables:

  • balances
  • blocks
  • contracts
  • logs
  • assets
  • transactions

Example table schema: blocks

  • timestamp
  • number
  • hash
  • parent_hash

Q: Seems like you would need funding to maintain these additional nodes / airflow servers. How will the operations and ongoing development work?

A:

I will also, strongly, consider hosting my own stakepool that could be used to fund the public data operations / innovations to continue the mission: democratization of data in the Cardano ecosystem. I welcome other approaches as well if anyone has ideas. I estimate roughly $8K USD per 6 months with pretty high usage atm.

Q: How much funds are being requested and for what?

A:

20,000 Cardano ETL Development

  • Third Party Subscriptions / Tools
  • ~200 Hours over 3 months to get to alpha / beta (open source github)
  • ~200 Hours over following 6 months for production + monitoring and operations (GCP kubernetes)

8,000 USD for 6 Months of GCP usage costs for operations

  • ~30% ETL Compute/Networking/GKE Managed Kubernetes/Stackdriver/…
  • ~70% BigQuery Storage and Queries (w/ buffer for 50TB scanned per month)

2,000 USD BigQuery Dataset Marketing w/ use cases

 

—–

 

I’m super interested in what the community thinks about all of these ideas, so please bring all the critical feedback you have for any of the items I listed (or additional thoughts of course!). This community really is humbling to watch thus far! Thank you for your time!

Requested funds in USD. Only use numbers! No symbols, letters, fractions. Payment will be in ada. 30000

Which of these definitions apply to you? Developer, Entrepreneur

Want to register as a community advisor? Confirm all following statements are true: I want to serve as a community advisor. I did not submit a funding proposal for Fund3. I am not affiliated with any proposing team in Fund3. I commit to provide fair and thoughtful reviews. I confirm, all of the above statements are true.

Tags

Cardano ETL: Public BigQuery Data

Goal: $30,000.00
318 Campaigns | 0 Loved campaigns

Story

Cardano BigQuery SimpleExplore and run machine learning code with Kaggle Notebooks | Using data from no data sources

GitHub

input-output-hk/cardano-db-syncA component that follows the Cardano chain and stores blocks and transactions in PostgreSQL – input-output-hk/cardano-db-sync

GitHub

blockchain-etl/ethereum-etlPython scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in…

Reviews

There are no reviews yet.

Be the first to review “Cardano ETL: Public BigQuery Data”

Your email address will not be published. Required fields are marked *