Solution for DB Query on Multi DB Servers

In HMS why do we need this tool?

To increase the security as well as efficiency, even though using MySQL as the main DB, we divide them into 6 separate MySQL Servers
Each Server contains one or more Databases
There is always a need to join cross dbs between different servers
Right now, what we did are:
a. Use App APIs to query and then store in a temp table in the same server and then join --> Microserver Best Practice but it takes a lot of devs and codes
b. Extract/Pull all of them to a datalake ( GCP Bigquery for now) and then query from there --> Correct solution for a Big Data Concept but its not match for a real time request and it costly if we want to do that.

So what we need is a way to query from multiple database source and joinable. Morever, if its SQL-like, it will offload a lot of dev tasks and share them with Data Analysts

By query directly on the Source without ETL, we can guarrantee the REALTIME as well as maximize the cost ( coz we can do them ON-DEMAND)

So lets introduce PRESTO

What is PRESTO?

“Presto (or PrestoDB) is an open source, distributed SQL query
engine, designed from the ground up for fast analytic queries against
data of any size.”

It interfaces both non-relational data sources like Amazon S3 and
Hadoop HDFS, MongoDB, and HBase, as well as relational databases like
MySQL, PostgreSQL, and MS SQL Server.

Presto’s power and value proposition is defined by the fact that it
can query data wherever it is stored, without the need to move the
data into a separate, structured system like a relational database or
data warehouse. Finally, query execution runs in parallel over a
scalable “pure memory-based architecture,” returning results in
seconds irrespective of the size of the data being queried.

Presto was first developed by Facebook to run interactive queries on a
300 petabyte (PB) data warehouse, structured as Hadoop clusters. It
was implemented across the entire company in 2013.

https://www.upsolver.com/blog/apache-presto-what-is-it-and-why-you-should-use-it

PRESTO HISTORY and DERIVIATIONs

Presto History

Presto Config for HMS

Super Simple: what you need is declaring 6 connectors for 6 Mysql and put them in the Connector Config

sh-4.2$ lqkhoi@DESKTOP-3UMSCJC:~$ cat vp_team6.properties
connector.name=mysql
connection-url=jdbc:mysql://......cloudhms.io:20206
connection-user=
connection-password=.....
case-insensitive-name-matching=true

Presto Naming

For each of the source, Presto will call it a CATALOG
Therefore the naming for each DB Table will be

CATALOG.SCHEMA.TABLE

For example:

nonvp_team1.prod_pms_rate.rate

Presto Naming Notice

By default, Presto is CASE INSENSITIVE. Therefore, it will not work well with MySQL Case Sensitive Table and Column Names.

Luckily, by 2021, Presto add a workaround support for this limitation.

By adding one line in the Config, Presto will search and compare based in INSENSITIVE Search. Therefore, it will work and adapt to HMS DB Naming which use Camel Case Naming.

case-insensitive-name-matching=true

SELECT h.code as property , apa.ExecuteDate , sum(case when apa.IsBreakfast = 1 then apa.QuantityPerStayingDate end) as breakfast FROM vp_team6.prod_ops_transaction_account.AutomationPostingActivity apa left join vp_team2.prod_pms_property.pro_hotel h on apa.PropertyId = h.id where true and apa.PackageType = 'INCLUDE' and apa.State in (1,5) and apa.ExecuteDate BETWEEN CURRENT_DATE and DATE_ADD('day',60, CURRENT_DATE) and h.code = 'VOPQ' group by 1,2;

Presto Join Example

Presto Client UI and DB

You can use Presto Client CLI to test which is provided as a JAR File

For UI, you can use DataGrip or even SQL Lab in SuperSet BI Tool

For Library, its available in C, Python, Java and Node in native and also via JDBC case

https://prestodb.io/resources.html

enter image description here –> enough to cover HMS

Presto SQL Syntax

It is designed to support standard ANSI SQL semantics, including complex queries, aggregations, joins, left/right outer joins, sub-queries, window functions, distinct counts, and approximate percentiles

Neverthelese, some Date/Time Builtin Function may not support by Presto

https://prestodb.io/docs/current/functions.html

Presto Deploy on Cloud

sudo ln -s /share/CACHEDEV1_DATA/.qpkg/QPython39/bin/python3 /usr/local/bin/python3

I deployed Presto by myself using a simple Docker on a single node.

however, for production it must be much more complicated than that. Normal it will need to deploy on a cluster minimum with 1 master node and 2 worker nodes.

For AWS, lattest EMR service support both Presto and Trino ( fork Presto).

For GCP, you can get Presto on GCP DataProc Service

AWS also provide a fully serverless Presto based called Athena. However, the cost may be a little bit steep ( price is on par with Big Query and calculated based on Scanning: 1TB Scanning for 5 USD).

For us, the easiest way should be using minimum EMR Setup which is 1 master node and 2 worker

PRESTO REAL TEST

Querying on Real VP Production and join the biggest table on team 6 ( 6G rows) with the default property table on team 2.

Running on a Docker simulated running on Windows WSL2 8GB RAM and connecting to DB via VPN

It takes nearly 3 minutes to execute this SQL

SELECT h.code as property , apa.ExecuteDate ,
 sum(case when apa.IsBreakfast = 1 then apa.QuantityPerStayingDate end) as breakfast 
 FROM vp_team6.prod_ops_transaction_account.AutomationPostingActivity apa 
 left join 
 vp_team2.prod_pms_property.pro_hotel h 
 on 
 apa.PropertyId = h.id 

where true 
and apa.PackageType = 'INCLUDE' 
and apa.State in (1,5) 
and apa.ExecuteDate BETWEEN CURRENT_DATE and DATE_ADD('day',60, CURRENT_DATE) 
and h.code = 'VOPQ' 
group by 1,2;

And the result is

enter image description here

Written with StackEdit.

Gia Dinh Cun Ga

Search This Blog