Skip to main content

Lyve Cloud Documentation

Using Analytics services

Using a self-service model, you can view and manage the application services from the analytics dashboard. You can also install the additional software on the platform services for real-time data analytics and visualization tools.

The application services include default and user-defined services. Listed below are a few applications which can help you with data science workflow, data collection to production and more. The platform allows you to automate the deployment, scale and manage the application services. For more information on services, see Working with Services.

You can use the following services as default services or as part of the Lyve Cloud integration. You can use either for data analytics:

Lyve cloud Analytics platform K8S Node Label automates the provisioning and lifecycle management of nodes for Kubernetes clusters. You can provision optimized groups of nodes with labels for these clusters.

When you configure nuclio and mlrun services in the Custom parameters tab, the node selector users can enter a label that identifies the servers on which the pod can run.

Procedure. To set the labels on App Nodes:
  1. On the Clusters page, select the App Nodes tab and Edit Label for one or more nodes.

    1.png
  2. Select + and type in a key: value pair, and select Apply.

    You can add more key: value pairs.

    2.png

    The label name (or names) is displayed in the node row

The analytics platform provides Trino as a service for data analysis. Trino is a distributed query engine that accesses data stored on object storage through ANSI SQL. This is also used for interactive query and analysis.

Trino is integrated with enterprise authentication and authorization automation to ensure seamless access provisioning with access ownership at the dataset level residing with the business unit owning the data. With Trino resource management and tuning, we ensure 95% of the queries are completed in less than 10 seconds to allow interactive UI and dashboard fetching data directly from Trino. This avoids the data duplication that can happen when creating multi-purpose data cubes.

DBeaver is a universal database administration tool to manage relational and NoSQL databases. Users can connect to Trino from DBeaver to perform the SQL operations on the Trino tables.

Using_Trino.png
Procedure. To configure Trino with Lyve Cloud:
  1. On the left-hand menu of the Platform Dashboard, select Services and then select New Services.

    1.png
  2. In the Create a new service dialogue, complete the following:

    • Basic Settings: Configure your service by entering the following details:

      • Service type: Select Trino from the list.

      • Service name: Enter a unique service name. This name is listed on the Services page.

      • Description: Enter the description of the service.

      • Enabled: The check box is selected by default. Selecting the option allows you to configure the Common and Custom parameters for the service.

      2.png
    • Common Parameters: Configure the memory and CPU resources for the service. The platform uses the default system values if you do not enter any values.

      Note

      When setting the resource limits, consider that an insufficient limit might fail to execute the queries.

      • Memory: Provide a minimum and maximum memory based on requirements by analyzing the cluster size, resources and available memory on nodes. Trino uses memory only within the specified limit.

      • CPU: Provide a minimum and maximum number of CPUs based on the requirement by analyzing cluster size, resources and availability on nodes. Trino uses CPU only the specified limit.

      • Priority Class: By default, the priority is selected as Medium. You can change it to High or Low.

      • Running User: Specifies the logged-in user ID.

      • Shared: Select the checkbox to share the service with other users.

      Common_Parameters.png
    • Custom Parameters: Configure the additional custom parameters for the Trino service.

      • Replicas: Configure the number of replicas or workers for the Trino service.

      • Enable Hive: Select the check box to enable Hive. Once enabled, You must enter the following:

        • Username: Enter the username of the platform (Lyve Cloud Compute) user creating and accessing Hive Metastore.

        • Container: Select big data from the list. This is the name of the container which contains Hive Metastore.

        • Hive Metastore path: Specify the relative path to the Hive Metastore in the configured container.

        Common_Parameters-without_advance.png
  3. Select Create Servive.

Procedure. To configure Trino with Lyve Cloud
  1. On the Services page, select the Trino services to edit.

  2. Select ellipses and then select Edit.

    Edit_Trino_service.png
  3. On the Edit service dialog, select the Custom Parameters tab.

  4. Expand Advanced, in the Predefined section, and select the pencil icon to edit Hive.

    Edit_Hive_-_Copy__2_.png
  5. Specify the following in the properties file:

    Property name

    Description

    hive.s3.aws-access-key

    Lyve cloud S3 access key is a private key used to authenticate for connecting a bucket created in Lyve Cloud. The access key is displayed when you create a new service account in Lyve Cloud. A service account contains bucket credentials for Lyve Cloud to access a bucket. For more information, see Creating a service account.

    hive.s3.aws-secret-key

    Lyve cloud S3 secret key is private key password used to authenticate for connecting a bucket created in Lyve Cloud. The secret key displays when you create a new service account in Lyve Cloud. For more information, see Creating a service account.

    hive.s3.endpoint

    Enter Lyve Cloud S3 endpoint of the bucket to connect to a bucket created in Lyve Cloud. For more information, see the S3 API endpoints.

    hive.s3.ssl.enabled

    Use the HTTPS to communicate with Lyve Cloud API. By default, it is set to true.

    hive.s3.path-style-access

    Use path-style access for all requests to access buckets created in Lyve Cloud. This is for S3-compatible storage that doesn’t support virtual-hosted-style access. By default it is set to false.

For more information about other properties, see S3 configuration properties.

Procedure. To configure advanced settings for Trino service:
  1. On the Services page, select the Trino services to edit.

  2. Select ellipses and then select Edit.

    Edit_Trino_service.png
  3. On the Edit service dialog, select the Custom Parameters tab.

  4. Expand Advanced, to edit the Configuration File for Coordinator and Worker.

    edit_service-advance.png
  5. You can edit the properties file for Coordinators and Workers. Select the Coordinator and Worker tab, and select the pencil icon to edit the predefined properties file.

    The following are the predefined properties file:

    • log properties: You can set the log level. For more information, see Log Levels.

    • JVM Config: It contains the command line options to launch the Java Virtual Machine. For more information, see JVM Config.

    • Config Properties: You can edit the advanced configuration for the Trino server. For more information, see Config properties.

    • Catalog Properties: You can edit the catalog configuration for connectors, which are available in the catalog properties file. For more information, see Catalog Properties.

    config_files.png

Assign a label to a node and configure Trino to use a node with the same label and make Trino use the intended nodes running the SQL queries on the Trino cluster. During the Trino service configuration, node labels are provided, you can edit these labels later.

Procedure. To assign node label to Trino:
  1. On the Services menu, select the Trino service and select Edit.

  2. In the Edit service dialogue, verify the Basic Settings and Common Parameters and select Next Step.

  3. In the Node Selection section under Custom Parameters, select Create a new entry.

    Trino_label-1.png
  4. Specify the Key and Value of nodes, and select Save Service.

    Trino_label-2.png

Note

The values in the image are for reference.

Prerequisite before you connect Trino with DBeaver.

Procedure. To connect Trino in DBeaver
  1. Start DBeaver.

  2. In the Database Navigator panel and select New Database Connection.

  3. In the Connect to a database dialog, select All and type Trino in the search field.

  4. Select Trino logo and select Next.

    Trino.png
  5. Select the Main tab and enter the following details:

    • Host: Enter the hostname or IP address of your Trino cluster coordinator.

    • Port: Enter the port number where the Trino server listens for a connection.

    • Database/Schema: Enter the database/schema name to connect.

    • Username: Enter the username of Lyve Cloud Analytics by Iguazio console.

    • Password: Enter the valid password to authenticate the connection to Lyve Cloud Analytics by Iguazio.

    Connect_to_DB-Main_tab.png
  6. Select Driver properties and add the following properties:

    • SSL: Set SSL to True.

    • SSL Verification: Set SSL verification to None.

    Driver_properties.png
  7. Select Test Connection.

    Sucess_mess_-_Trino_withDBeaver.png

    If the JDBC driver is not already installed, it opens the Download driver files dialog showing the latest available JDBC driver. You must select and download the driver.

    Download_driver.png
  8. Select Finish once the testing is completed successfully.

When you create a new Trino cluster, it can be challenging to predict the number of worker nodes needed in future. The number of worker nodes ideally should be sized to both ensure efficient performance and avoid excess costs. Scaling can help achieve this balance by adjusting the number of worker nodes, as these loads can change over time.

Note

The Lyve Cloud analytics platform supports static scaling, meaning the number of worker nodes is held constant while the cluster is used.

Procedure. To scale Trino services
  1. On the left-hand menu of the Platform Dashboard, select Services.

    Scaling_Trino-1.png
  2. Select the ellipses against the Trino services and select Edit.

    Edit_Trino_service.png
  3. Skip Basic Settings and Common Parameters and proceed to configure Custom Parameters.

    • In the Custom Parameters section, enter the Replicas and select Save Service.

      Scaling_Trino-4.png

      Trino scaling is complete once you save the changes.

After you install Trino the default configuration has no security features enabled. You can enable the security feature in different aspects of your Trino cluster.

Authentication in Trino

You can configure a preferred authentication provider, such as LDAP.

You can secure Trino access by integrating with LDAP. After completing the integration, you can establish the Trino coordinator UI and JDBC connectivity by providing LDAP user credentials.

To enable LDAP authentication for Trino, LDAP-related configuration changes need to make on the Trino coordinator.

  1. On the left-hand menu of the Platform Dashboard, select Services.

    1-Se.png
  2. Select the ellipses against the Trino services and select Edit.

    2-se.png
  3. Skip Basic Settings and Common Parameters and proceed to configure Custom Parameters.

    • In the Advanced section, add the ldap.properties file for Coordinator in the Custom section

      3-se.png

      Configure the password authentication to use LDAP in ldap.properties as below.

      password-authenticator.name=ldap
      ldap.url=ldaps://<ldap-server>:636
      ldap.user-bind-pattern=<Refer below for
      usage>
      ldap.user-base-dn=OU=Sites,DC=ad,DC=com

    Property name

    Description

    ldap.url

    The URL to the LDAP server. The URL scheme must be ldap:// or ldaps://.

    It connects to the LDAP server without TLS enabled requires ldap.allow-insecure=true.

    ldap.user-bind-pattern

    This property can be used to specify the LDAP user bind string for password authentication. This property must contain the pattern ${USER}, which is replaced by the actual username during password authentication.

    The property can contain multiple patterns separated by a colon. Each pattern is checked in order until a login succeeds or all logins fail.

    For example: ${USER}@corp.example.com:${USER}@corp.example.co.uk

    ldap.user-base-dn

    The base LDAP distinguished name for the user trying to connect to the server.

    For example: OU=America,DC=corp,DC=example,DC=com

Authorization based on LDAP group membership

You can restrict the set of users to connect to the Trino coordinator in following ways:

  • based on their group membership

  • by setting the optional ldap.group-auth-pattern  property

In addition to the basic LDAP authentication properties. Add below properties in ldap.properties file.

Property name

Description

ldap.group-auth-pattern

This property is used to specify the LDAP query for the LDAP group membership authorization. This query is executed against the LDAP server and if successful, a user distinguished name is extracted from a query result. Trino validates user password by creating LDAP context with user distinguished name and user password.

For more information about authorization properties, see Authorization based on LDAP group membership.

Add the ldap.properties file details in config.propertiesfile of Cordinator using the password-authenticator.config-files=/presto/etc/ldap.properties property:

Save changes to complete LDAP integration.

Note

You must configure one step at a time and always apply changes on dashboard after each change and verify the results before you proceed.

Once the Trino service is launched, create a web-based shell service to use Trino from the shell and run queries.

  1. On the left-hand menu of the Platform Dashboard, select Services and then select New Services.

    1.png
  2. In the Create a new service dialogue, complete the following:

    • Basic Settings: Configure your service by entering the following details:

      • Service type: Select Web-based shell from the list.

      • Service name: Enter a unique service name. This name is listed on the Services page.

      • Description: Enter the description of the service.

      • Enabled: The check box is selected by default. Selecting the option allows you to configure the Common and Custom parameters for the service.

      2.png
    • Common Parameters: Configure the memory and CPU resources for the service. The platform uses the default system values if you do not enter any values.

      Note

      When setting the resource limits, consider that an insufficient limit might fail to execute the queries.

      • Memory: Provide a minimum and maximum memory based on requirements by analyzing the cluster size, resources and available memory on nodes. Web-based shell uses memory only within the specified limit.

      • CPU: Provide a minimum and maximum number of CPUs based on the requirement by analyzing cluster size, resources and availability on nodes. Web-based shell uses CPU only the specified limit.

      • Priority Class: By default, the priority is selected as Medium. You can change it to High or Low.

      • Running User: Specifies the logged-in user ID.

      • Shared: Select the checkbox to share the service with other users.

      3.png
    • Custom Parameters: Configure the additional custom parameters for the Web-based shell service.

      • Spark: Assign Spark service from drop-down for which you want a web-based shell.

      • Trino: Assign Trino service from drop-down for which you want a web-based shell.

      • Service Account: A Kubernetes service account which determines the permissions for using the kubectl CLI to run commands against the platform's application clusters.

      4.png
  3. Select CREATE SERVICE

Sample querying to Trino

After you create a Web based shell with Trino service, start the service which opens web-based shell terminal to execute shell commands.

  1. Select the web-based shell with Trino service to launch web based shell.

    5.png
  2. The Web based shell launches in new tab.

    6.png
  3. Enter the Trino command to run the queries and inspect catalog structures.

    7.png
Example 6. Creating a schema

Create a Schema with a simple query CREATE SCHEMA hive.test_123. After the schema is created, execute SHOW create schema hive.test_123 to verify the schema.

trino> create schema hive.test_123;
CREATE SCHEMA

trino> show create schema hive.test_123;
Create Schema

------------------------------------------------------------------
CREATE SCHEMA hive.test_123

AUTHORIZATION USER edpadmin
 WITH (
    location ='v3io://projects/user/hive/warehouse/test_123.db'
     )


Example 7. Creating a sample table and with the table name as Employee

Create a sample table assuming you need to create a table named employee using CREATE TABLE statement.

trino> CREATE TABLE IF NOT EXISTS hive.test_123.employee (eid
varchar, name varchar,
    -> salary varchar, destination varchar);

CREATE TABLE

trino> SHOW CREATE TABLE hive.test_123.employee;
             Create Table

---------------------------------------
CREATE TABLE hive.test_123.employee (
    eid varchar,
    name varchar,
    salary varchar,
    destination varchar
 )
 WITH (
    format = 'ORC'
 )


Example 8. Inserting data into the table

Insert sample data into the employee table with an insert statement.

trino> INSERT INTO hive.test_123.employee VALUES ('1201', 'Mark','45000','Technical manager');
    ->
    -> INSERT INTO hive.test_123.employee VALUES ('1202', 'Paul','45000','Technical writer');
    ->
    -> INSERT INTO hive.test_123.employee VALUES ('1203', 'Allen','40000','Hr Admin');
    ->
    -> INSERT INTO hive.test_123.employee VALUES ('1204', 'John','30000','Op Admin');


Example 9. Viewing data in the table

View data in a table with select statement.


trino> select * from hive.test_123.employee;

eid      | name     | salary     |  destination
------+-------+--------+-------------------
1203     | Allen    | 40000      | Hr Admin

1201     | Mark     | 45000      | Technical manager

1202     | Paul     | 45000      | Technical writer

1204     | John     | 30000      | Op Admin



The platform uses MLRun to submit spark jobs and Spark Operator for running Spark jobs over k8s. When sending a request with MLRun to the Spark operator, the request contains your full application configuration, including the code and dependencies to run (packaged as a docker image or specified via URIs), the infrastructure parameters (e.g. the memory, CPU, and storage volume specs to allocate to each Spark executor), and the Spark configuration. Kubernetes takes this request and starts the Spark driver in a Kubernetes pod (a k8s abstraction, just a docker container in this case). The Spark driver then communicates directly with the Kubernetes master to request executor pods, scaling them up and down at runtime according to the load if the dynamic allocation is enabled. Kubernetes takes care of the bin-packing of the pods onto Kubernetes nodes (the physical VMs) and dynamically scales the various node pools to meet the requirements.

You can submit spark jobs on K8s via Jupyter Notebook. For more information, see Spark operator:

How to submit a spark job?

You can submit a spark job using any of the following:

Procedure. To submit a spark job using python:
  1. Install the mlrun package using the following command.

    pip install mlrun
    
  2. Run the following code with Python or Jupyther notebook

    Run the following command to import a new function. This will set up a new spark function with the spark operator, and the command uses the spark code and locate it in the file system.

    mlrun.run import new_function

    Set the spark driver and executor config

    sj = new_function(kind='spark',  command='pi.py', name='my-pi-python')

    Add the fuse, daemon and Jar support.

    sj.with_driver_requests(cpu="1",mem="512m")
    sj.with_executor_requests(cpu="1",mem="512m")
    sj.with_driver_limits(cpu="1")
    sj.with_executor_limits(cpu="1")
    
    

    Adds fuse, daemon & Iguazio's jars support.

    sj.with_igz_spark()
    

    Set the artifact path to save the artifact.

    sr = sj.run(artifact_path="local:///tmp/")
Procedure. To submit a spark job using Java:
  1. Install the mlrun package using the following command.

    pip install mlrun
    
  2. Run the following code with Python or Jupyther notebook

    Run the following command to import a new function. This will set up a new spark function with the spark operator, and the command uses the spark code and locate it in the file system.

    mlrun.run import new_function

    Set the spark driver and executor config

    sj = new_function(kind='spark', command='/User/spark-examples_2.12-3.0.1.jar', name='my-pi-java') 

    Add the fuse, daemon and Jar support.

    sj.with_driver_requests(cpu="1",mem="512m")
    sj.with_executor_requests(cpu="1",mem="512m")
    sj.with_driver_limits(cpu="1")
    sj.with_executor_limits(cpu="1")
    
    

    Add the Java class name to execute.

    sj.with_igz_spark()
    sj.spec.job_type = "Java"
    
    sj.spec.main_class= "org.apache.spark.examples.JavaSparkPi"
    

    Set the artifact path to save the artifact.

    sr = sj.run(artifact_path="local:///tmp/")
Procedure. To submit a spark job using scala:
  1. Install the mlrun package using the following command.

    pip install mlrun
    
  2. Run the following code with Python or Jupyther notebook

    Run the following command to import a new function.

    This will set up a new spark function with the spark operator, and the command uses the spark code and locates it in the file system.

    mlrun.run import new_function

    Set the spark driver and executor config.

    sj = new_function(kind='spark', command='/User/spark-examples_2.12-3.0.1.jar', name='my-pi-scala') 
    sj.with_driver_requests(cpu="1",mem="512m")
    sj.with_executor_requests(cpu="1",mem="512m")
    sj.with_driver_limits(cpu="1")
    sj.with_executor_limits(cpu="1")
    
    

    Add the fuse, daemon and Jar support.

    sj.with_igz_spark()
    sj.spec.job_type = "Scala"
    

    Add the scala class name to execute

    sj.spec.main_class= "org.apache.spark.examples.SparkPi"
    

    Set the artifact path to save the artifact.

    sr = sj.run(artifact_path="local:///tmp/")

Assign a driver or executor label to the node with the same label while submitting the jobs. It will use the driver and executor node from the provided list. For more information about node label, see Setting Labels on App Nodes.

Procedure. To assign a node label to the spark job:
  • Submitting job via Jupyter notebook or Python.

    Provide the driver node label and executor node label when you submit a job using Jupyter notebook. The following code snippet is about driver and executor node section with node label key and node label.

    sj.with_driver_node_selection(node_selector={node_label_key:driver_node_label})
    sj.with_executor_node_selection(node_selector={node_label_key:executor_node_label})
    Best practices to optimize Spark jobs

    Optimizing your Spark jobs is critical to process large workloads, especially when resources are limited. With Spark, you can influence many configurations before using cluster elasticity. Below are recommendations for improving your Spark jobs

    1. Use the correct file format

      Most Spark jobs run as a pipeline where one Spark job writes data to a File, and another Spark job reads the data, processes it, and writes to another file to pick up. Writing intermediate files in serialized and optimized formats like Avro, Kryo, Parquet, etc., performs better than text, CSV, or JSON formats.

      Spark’s default file format is Apache Parquet which follows column-oriented storage. Data is stored contiguously within the same column, which is well-suited to perform queries (transformations) on a subset of columns and on large dataframe . Only the data associated with the required columns are loaded into memory.

      Trino is the faster SQL with its custom ORC reader implementation. ORC uses a two step system to decode data. The first step is a traditional compression algorithm like gzip that generically reduces data size. The second step has data type specific compression algorithms that convert the raw bytes into values.

    2. Maximize parallelism in Spark

      Spark is efficient because it can process several tasks in parallel at scale. Creating a clear division of tasks and allowing Spark to read and process data in parallel will optimize performance. Users should split a dataset into several partitions that can be read and processed independently and in parallel.

      Create partitions in the following manner:

      • When reading the data, create a partition using the Spark parameter spark.sql.files.maxPartitionBytes (default value of 128 MB) . Data stored in several partitions on a disk is the best-case scenario. The following example shows a dataset in parquet format with a directory containing data partition files between 100 and 150 MB in size.

        df.repartition(100) 
        df.coalesce(100)

        Increase the number of partitions by lowering the value of the spark.sql.files.maxPartitionBytes parameter. Note: This choice can lead to issues for files with small file sizes So, parameters should be defined empirically according to the available resources.

        Directly in the Spark application code using the Dataframe API.

        Coalesce decreases the number of partitions while avoiding a shuffle in the network. Increase the number of partitions by lowering the value of the spark.sql.files.maxPartitionBytes parameter.

    3. Beware of shuffle operations

      Shuffle partitions are created during the stages of a job involving a shuffle, i.e. when a wide transformation (For example, groupBy(), join()) is performed. The setting of these partitions impacts both the network and the read/write disk resources.

      The number of partitions may be adjusted by changing the value of spark.sql.shuffle.partitions. This parameter should be adjusted according to the size of the data. The default number of partitions is 200, which may be too high for some processes and result in too many partitions being exchanged between executing nodes.

      Configurations

      Default Value, recommendation and description

      spark.driver.memory

      The default value is 1GB. This represents the memory allocated to the Spark driver to collect data from the executor nodes.

      For example, collect()operation

      spark.shuffle.file.buffer

      The default value is 32 KB. We recommend increasing this to 1 MB. This allows the spark to buffer before writing results to disk.

    4. Use Broadcast Hash Join

      A join between several dataframes is a common operation. In a distributed context, a large amount of data is exchanged in the network between the executing nodes to perform the join. Depending on the size of the tables, this exchange causes network latency, which slows down processing. Spark offers several join strategies to optimize this operation, such as Broadcast Hash Join (BHJ). This technique is suitable when one of the merged dataframes is sufficiently small to be duplicated in memory on all the executing nodes.

      By duplicating the smallest table, the join no longer requires any significant data exchange in the cluster apart from the broadcast of this table beforehand, which significantly improves the speed of the join. The Spark configuration parameter to modify is spark.sql.autoBroadcastHashJoin. The default value is 10 MB, i.e., this method is chosen if one of the two tables is smaller than this size. If sufficient memory is available, it may be very useful to increase this value or set it to -1 to force Spark to use it.

    5. Cache intermediate results

      Spark uses lazy evaluation and a DAG to describe a job to optimize computations and manage memory resources. This offers the possibility for a quick recalculation of the steps before action and thus executing only part of the DAG. To take full advantage of this functionality, it is encouraged to store expensive intermediate results if several operations use them downstream of the DAG. If an action is run, its computation can be based on these intermediate results and thus only replay a sub-part of the DAG before this action. Users may decide to cache immediate results to speed up the execution.

      dataframe = spark.range(1 *
      1000000).toDF("id").withColumn("square",col("id")**2)
      dataframe.persist()
      dataframe.count()

      The above code took 2.5 Seconds to execute.

      dataframe.count()
      

      The subsequent execution took 147 ms, requiring lesser time as data is cached. The same also can be cached using the below.

      dataframe.persist(StorageLevel="MEMORY_ONLY")

      The full list of options is available here.

    6. Manage the memory of the executor nodes

      The memory of a Spark executor is broken down to execution memory, storage memory and reserved memory.

      Execution memory = spark.memory.fraction * (spark.executor.memory - Reserved memory)
      

      By default, the spark.memory.fraction parameter is set to 0.6. This means that 60% of the memory is allocated for execution and 40% for storage. Once the default reserved memory of 300 MB is removed, it prevents out-of-memory (OOM) errors.

      We can modify the following two properties:

      spark.executor.memory
      spark.memory.fraction
    7. Don’t use collect (). use take() instead

      When a user calls the collect action, the result is returned to the driver node. This might seem innocuous initially, but if you are working with huge amounts of data, the driver node can easily run out of memory. The take() action is simple—it scans and returns the first partition it finds. For example, if you just want to get a feel of the data, then take(1) row of data.

      df = spark.read.csv("/FileStore/tables/train.csv", header=True)
      df.collect()
      df.take(1)
    8. Avoid using UDF

      UDFs are a black box to Spark hence it can’t apply optimization, and it loses all the optimization that Spark does on Dataframe/Dataset (Tungsten and catalyst optimizer). Whenever possible, we should use Spark SQL built-in functions as these functions are designed to provide optimization.

    9. Use Broadcast and Accumulators variables

      Accumulators and Broadcast variables are both Spark-shared variables. Accumulators is a write /update shared variable whereas Broadcast is a read shared variable. In a distributed computing engine like Spark, it’s necessary to know the scope and life cycle of variables and methods while executing. This is important to understand because data is divided and executed in parallel on different machines in a cluster.

      • Broadcast Variable: It is a read only variable that is cached on all the executors to avoid shuffling of data between executors. Broadcast variables are used as lookups without any shuffle, as each executor will keep a local copy.

      • Accumulator variable: Accumulators are also known as counters in map reduce, which is an update variable. It is used when you want Spark workers to update some value.

    10. Partition and Bucketing columns

      Shuffling is the main reason that affects Spark Job performance. In order to reduce shuffling of data, one of the more important techniques is to use bucketing and repartitioning of data instead of broadcasting, caching/persisting dataframes.

      These are the 2 optimization techniques which we used for hive tables. Both are used to organize large filesystem data that is leveraged in subsequent queries to efficiently distribute data into partitions. partitionBy() is a method of grouping the same type of data into partitions which will be stored in a directory. Bucketing is also a similar kind of grouping, but it is based on hashing technique which will be stored as a file.

    11. Garbage Collection Tuning

      JVM garbage collection can be a problem when you have a large collection of unused objects. The first step in GC tuning is to collect statistics by choosing verbose while submitting spark jobs. In an ideal situation, we try to keep GC overheads < 10% of heap memory.

    Nuclio is an open-source and managed serverless platform used to minimize development and maintenance overhead and automate the deployment of data-science-based applications. The framework focused on data, I/O, and compute-intensive workloads. It is well integrated with popular data science tools, such as Jupyter and Kubeflow, supports a variety of data and streaming sources, and supports execution over CPUs and GPUs.

    You can use Nuclio through a fully managed application service, the Lyve cloud analytics platform. MLRun serving utilizes serverless Nuclio functions to create multi-stage real-time pipelines.

    Nuclio addresses the desired capabilities of a serverless framework:

    • Real-time processing with minimal CPU/GPU and I/O overhead and maximum parallelism

    • Native integration with a large variety of data sources, triggers, processing models, and ML frameworks

    • Stateful functions with data-path acceleration

    Procedure. Following is the sample code snippet about trigger rabbit-mq from mlrun
    1. exchangeName: The exchange that contains the queue

    2. topics and queueName: They are mutually exclusive. The trigger can either create an existing queue specified by queueName or create its queue, subscribing it to topics.

    3. topics: If you Specify the trigger, create a queue with a unique name and subscribe it to these topics.

    4. url: It is the actual host URL and port details where queue address details are available.

      import mlrun
      import json
      class Echo:
          def __init__(self, context, name=None, **kw):
              self.context = context
              self.name = name
              self.kw = kw
      
          def do(self,x):
              y = type(x)
              print("Echo:", "done consuming", y)
              return x
      
      function = mlrun.code_to_function("rabbit2",kind="serving", 
                                        image="mlrun/mlrun")
      trigger_spec={"kind":"rabbit-mq",
            "maxWorkers": 1,
            "url": "amqp://athena-spr-kpiv-prod-admin:46f08c5d998e@spr-athena-rabbitmq.stni.seagate.com:5672/athena-spr-kpiv-prod-01",
            "attributes":{
              "exchangeName": "athena.spr.topic.inference",
              "queueName": "athena-spr-sampling-shadow-validate",
              "topic": "#.sampling.#"},
            }
      function.add_trigger("rabbitmq",trigger_spec)
      
      graph = function.set_topology("flow", engine="async") 
      graph.to(class_name="Echo", name="testingRabbit") 
      function.deploy()

    Grafana is an open-source solution for running data analytics, reporting metrics of massive data sets and providing customizable dashboards. The tool helps to visualize metrics, time series data and application analytics.

    The dashboard helps to:

    • Explore the data

    • Track user behaviour

    • Track application behaviour

    • Identify the frequency and type of errors in production or staging environments.

    Grafana with Lyve Cloud analytics platform natively integrates with other services so you can securely add, query, visualize, and analyze your data across multiple accounts and regions.

    Configuring Grafana service
    Procedure. To configure Grafana service:
    1. On the left-hand menu of the Platform Dashboard, select Services and then select New Services.

      1.png
    2. In the Create a new service dialogue, complete the following configuration:

      • Basic Settings: Configure your service by entering the following details:

        • Service type: Select Grafanafrom the list.

        • Service name: Enter the service name. This name is listed on the Services page.

        • Description: Enter the description of the service.

        • Enabled: The check box is selected by default. Selecting the option allows you to configure the Common and Custom parameters for the service.

        Create_new_service.png
      • Common Parameters: Configure the memory and CPU resources for the service. If you do not enter any values, the platform uses the default system values.

        Note

        When setting the resource limits, consider that an insufficient limit might fail to execute the queries.

        • Memory: Provide a minimum and maximum memory based on requirements by analyzing the cluster size, resources and available memory on nodes. Trino uses memory only within the specified limit.

        • CPU: Provide a minimum and maximum number of CPUs based on the requirement by analyzing cluster size, resources and availability on nodes. Trino uses CPU only the specified limit.

        • Priority Class: By default, the priority is selected as Medium. You can change it to High or Low.

        • Running User: Specifies the logged-in user ID.

        Common_Parameters.png
      • Custom Parameters: Configure the user and the node label key and value.

        • Platform data-access user: Enter the username/First Name/Last Name/Email of the user who has data access to the platform's data containers.

        • Node Selection: Assign a node for the services to run. The Grafana service will run only on the nodes that are defined with the labels. Select Create a new entry.

          Key: Enter the key for the node label.

          Value: Enter the value of the key.

        Note

        If there are conflicting key values, where the same key is assigned for multiple servers but it has different values, the system prompts you to delete the duplicate keys.

    3. Select Save Service.

    Understanding pre-built panels and dashboard

    Grafana makes it easy to construct the right queries and customize the display properties to create your required dashboard. With multiple pre-built dashboards for various data sources, you can instantly start visualizing and analyzing your application data without having to build dashboards from scratch.

    The following image shows a pre-built dashboard visualizing data for a Kubernetes cluster. Grafana provides pre-built dashboards to help you get started quickly.

    Grafana_Dashboard_-example.png
    Viewing Grafana dashboards

    You can view and create a new dashboard for a cluster.

    Procedure. To view the Grafana dashboard.
    1. On the left-hand menu of the Platform Dashboard, select Clusters.

    2. On the Clusters page select the Applications tab, and then select Status Dashboard.

      STatus_dashboard.png

      Selecting Status Dashboard directs you to the Grafana dashboard of the corresponding cluster.

      Cluster-Grafana_dashboard.png
    3. Monitor the cluster resource usage.

      In this case, Kubernetes Resource Usage Analysis is used as an example as a cluster resource usage. However you can search for other cluster resources such as Kubernetes Custer Health, Kubernetes Cluster Status, etc.

      1. On the Status Dashboard, select Dashboard, and then select Manage.

        STatus_dashboard-Manage.png
      2. On the Dashboard page, select Private folder, and search Kubernetes Resource Usage Analysis.

        Dashboard-search.png
      3. The Kubernetes Resource Usage Analysis displays the Overall usage, Total usage, Usage by Node, Usage by Pod, etc.

        Kubernetes_Resource_Usage_Analysis_-_Grafana.png

        You can expand the resources to view the graphical representation of the usage dashboard.

        Usage Overview

        Usage Overview
        Usage_-Overview.png
        Pending/ Failed Pods
        Pending-_Failed_Pods.png
        Pod Count
        Pod-count.png
        Usage - Total
        Usage_-_Total.png
        Usage – By Nodes
        Usage-by_nodes.png
        Usage – By Pods
        Usage_by_Pod.png

    Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Its uses include data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more. Refer to Jupyter Notebook Service for details about Jupyter's service and Jupyter flavours on the Lyve cloud analytics platform.

    Procedure. To create a Jupyter service
    1. Configure Jupyter from the cluster dashboard

    2. On the Services menu, select Add new service.

    3. In the Create a new service dialogue, complete the following:

      In the Basic Settings, enter the following and select Next.

      • Service Type: Select the type Jupyter Notebook from the list.

      • Service name: Enter the service name.

      • Description: Enter the description.

      1.png
    4. In the Common Parameters section, enter the following and select Next:

      • Memory: Set the maximum memory limit.

      • CPU: Set the maximum CPU limit.

      • Priority: Set the priority.

      2.png
    5. In the Custom Parameters section, enter the following:

      • Flavor: Specify Flavor to use for the resources.

      • Spark: Enter the spark job if required.

      • Trino: Select Trino from the list.

      • Environment Variable: Enter the Key and Key Value.

      • Node Selection: Enter the node label Key and Value of nodes details.

      3.png