Akash Thakare

Code Smells

Code smells are the the indicators of potential flaws in the design and structure, not necessarily prevents the functionality but signifies the scope of improvement in implementation.

Code smells are prominent in legacy systems, developers hardly take chance to do major refactoring. However, all of these code smells can be detected and resolved during the development.

Most commonly can be identified by questioning the written code in following way,

Is there any duplication?
Is it difficult to understand?
Is is complicated to explain?
Is it difficult to test?
Will it be difficult to maintain?

If your answer is 'Yes' to any of these questions, most probably, your code contains one or more code smells.

Let's understand this with following code snippet,

If we check above class closely we can find following issues,

Six arguments makes the implementation complex and increases the responsibility of the function.
Method expects the caller to send the students in sequence which opens up the window for bug.
For new student, method needs to be changed,
Return statement is not required.
For invalid input method is not raising any exception.
The value of 50 is a magic number, difficut to understand what it is.
Print method is again checking the student marks.
Subject parameter was not used.
... and so on.

Many of the issues are nowadays identified by the smart IDEs. Out of these issues, few of them are actually code smells.

Ask beforementioned questions again for this code and you will probably get 'Yes' for all the questions related to complexity, maintainability and readability of the code.

Following code smells are there in above class,

Long Parameter List
Duplication
Unnecessary Comments
Unnecessary Boxing-UnBoxing ( double & Double)
Unnecessarily Large Method
Dead Code

By questioning your code you can identify such code smells and fix them for the sake of clean coding practice. There are many more code smells which can occur while writing the code.

References:

Clean Code - Robert C. Martin
Code Smells - Martin Fowler

Java 21 came with an exciting feature of Virtual Thread - Lightweight Threads. The idea is to achieve optimal hardware utilisation, which has been a bottleneck for the conventional java.lang.Thread - platform threads.

Platform threads are 1:1 mapped with OS Thread, which makes them powerful to accomplish all possible tasks. java.lang.Thread is a thin layer on actual OS threads, which are quite heavy. The number of thread application can support depends on the hardware capacity of the system.

Let’s say thread memory consumption is 1 MB, to support 1 million threads in concurrent application at the time of heavy load, you will need 1 TB of memory. One thread per request style of implementation also suffer due to this limitation - asynchronous programming can solve this upto some extent but it has its own drawbacks.

Virtual threads effectively shares the platform thread. Rather than holding the platform thread for entire lifetime, it runs short lived tasks and for the needed executions - not while it waits for I/O. This allows insane number of concurrent operations without need of additional threads or hardware capacity. This brings up the new way to deal with concurrency in Java applications.

Virtual threads can be created with java.lang.Thread builder approach,

Runnable function = () -> System.out.println("Something to execute in Virtual Thread!");
Thread virtual = Thread.startVirtualThread(function);

In addition to that, java.util.concurrent.ExecutorService also has factory method for Virtual threads,

ExecutorService virtualExecutor = Executors.newVirtualThreadPerTaskExecutor();

Virtual threads are daemon threads, join() to wait on the main thread. Thread local variables of virtual thread are not accessible to carrier thread or vice versa.

Trunk Based Development

Trunk based branching model seems very inefficient in the first place. Initially, I was feeling that it is not utilizing the branching superpower of a version control system like GIT. You see there are a lot of advantages to using branches, including

…you can develop and test multiple things in parallel in separate branches
…you can work on multiple releases at the same time by creating a different branch
…different developers can work on separate branches and once done with the implementation can raise pull requests for review

All these benefits sound true when you choose branching over the trunk-based approach and coming from the same experience I was skeptical about how this model will help in development. However, I reasonably underestimated it as it offers a lot of positive results in the longer run and I’ll try to list down a few of them below.

What is Trunk Based approach?

Trunk based approach is a practice where all developers push their changes to one “trunk”, the main branch rather than creating a separate branch. Having said that this practice doesn’t stop developers from creating branches, in certain scenarios you can still create branches, but these branches must be short-lived which means the branches must be deleted within a few hours or at max within a day.

Why Trunk Based approach?

Continuous integration and early feedback

All the changes must be production deployable. As soon as the developer pushes the changes to the repository the CI/CD pipelines take care of all the things including static code analysis, scan, security issues, etc., and give immediate feedback to the team. The team won’t need to wait for the final merge and integration before release to make this happen.

Fewer conflicts and Fewer merge

Smaller non-breaking changes are pushed frequently throughout the day such that there are fewer conflicts between the developer changes. A developer ensures to sync at least once a day with the central repository to avoid conflicts as much as possible.

Easy to manage and review

Developers push smaller working changes, frequently, and reviewing smaller changes rather than one pull request with several files is still a better and more efficient option.

Encourages responsible push

Due to continuous integration with all environments including production (with toggle feature), it becomes necessary for developers to be conscious about the code they are pushing to the central repository.

Yet it’s not the best choice in all scenarios, in the following situation this approach may not help that much,

Open Source Projects: You won’t expect anyone to just push changes to your trunk directly, going through the pull request way is the best choice.
Less experienced developers: When you have more junior members who are still learning, you will need a stringent code review before they sync their changes to the main branch to avoid failures due to minor mistakes.

Conclusion

It is important to choose this based on the nature of your work. For small teams and new projects, this approach can surely help a lot while on the other hand, it can not be as efficient as it seems for legacy systems or teams with less experienced developers.

Extreme Programming

This post is about my basic understanding of Extreme Programming, in short XP. After joining Thoughtworks, I came to know about XP. The team I joined is upholding the XP practices very proudly and ensuring that it remains as much followed as possible. Having said that, I was trying to connect the dots between different aspects of XP including pair programming, quick feedback, shorter iterations, test-driven development and so and so forth.

All the new joiners in Thoughtworks are given the following book so that they can start exploring XP in more detail and be able to relate to it when added to the XP Team.

After reading this book I realized multiple things about XP which I want to share with you. I'm still exploring more about it, yet decided to share whatever I understood from this interesting book.

Extreme programming is a guideline for effective collaboration and ensured quality. It's a methodology to improve not just as an individual but mainly as a team.

Extreme programming is "extreme" because it takes the effectiveness of principles, practices, and values to its extreme. It asks you to do your best and then deal with the consequences, that's extreme. Programmers get past with "I know better than everyone else and all I need is to be left alone to be the greatest."

Further in the post, I will use the abbreviation XP for Extreme Programming.

XP is based on the 4Ps,

Philosophy of communication, courage, respect, feedback, and simplicity.
Principles of translating values to practices.
Practices to express the values correctly.
People share values and practices.

XP is about

Estimating your work accurately
Creating a rapid feedback loop
Delivering quality the first time

XP says Stay Aware. Adapt. Change.

XP includes three major components, Values, Principles, and Practices.

We will talk about each in detail.

Values

Roots of things we like and don't like in any situation.

XP embraces five values,

Communication

Effective communication between people involved in the project whether internal or external.
Choosing the right medium for communication. Creating lengthy unmanageable documents for everything is not effective all the time, quick team meetings and one on one discussions may suffice in many cases.
Ensuring inclusivity and respect while communicating for better outcomes.

Simplicity

Simplicity needs to be planned while complexity is accidental.
Many engineers fall into this rabbit hole of making things complex rather than simply showing off or pretending to be experts. When it comes to Software, keeping things simple rewards very well in the long run to the team and organization.
Thinking in a direction to avoid wasted complexity as much as possible.

Feedback

Keep sharing and receiving feedback.
Feedback about your day-to-day work, involvement in the team, and overall effectiveness as a team member.
Giving genuine feedback is equally important, be honest and respectful.

Courage

Choose to do the right thing even when feel afraid.
The team should create an environment to ensure everyone feels safe to speak on any aspect of work during meetings.

Respect

Everyone on the team is important for the success of the project.
I am an important member, so as you. Remember always.
Ensuring respect in case of difficult situations by focusing on problem rather than person.

Principles

The principles are a bridge between practices and values.

XP embraces principles like,

Humanity

Create an environment where team members feel safe to speak up.
Each member feels included.
When it comes to accomplishment, rather focus on the team's accomplishment compared to individual-level accomplishment.

Economics

Software development is even more valuable when it actually earns money.
Focusing on quality but keeping focus on generating values economically will help team and organization to grow.

Mutual Benefit

Values and practices must benefit the team now and future as well.
The benefits of XP should not be limited to the team but should be extended to the customer.

Self-Similarity

Avoid reinventing the wheel, try to see if your existing setup or structure can be replicated for some other use or not.
The more it gets replicated, the more feedback you will get for improving it in the long run.

Improvement

Perfection is the enemy of good.
XP focuses on delivering important things in smaller iterations.
Rather than giving everything perfect after months or years, XP recommends iteratively delivering things for better feedback and less overhead.

Diversity

Work together as a team for a given problem, ensuring each opinion and thought of all team members are taken into account.
The team should comprise all required skillsets.

Reflection

After every iteration, reflect and find what went well and what can be improved.
Appreciate good work and discuss the area of improvement.

Flow

Create a flow of deploying smaller increments of value frequently with the best quality.

Opportunity

Problems are opportunities, defects or failures can be treated as an opportunity to find the gaps and correcting.
Ensure the root cause of the problem is fixed. Fix the defect and ensure it should never resurface again.

Redundancy

Do not remove redundancy that serves a valid purpose.
Some of the practices may sound redundant but keeping them running would still help. For example, pair programming and continuous integration give early feedback and help to avoid silly errors, which don't reduce the importance of any of the practices individually.

Failure

When you don't know what to do, go for risking failure.
You are stuck at the point where you have to choose one of the two possible approaches for the implementation and can't figure out which one is better suitable. Go ahead and try both.
Have the courage to try and fail rather than play safe.

Quality

You won't be able to figure out the best possible way every time.
Do whatever best you can to deliver quality.
Keep looking for better and embrace change.

Baby Steps

Shorter cycles are like baby steps, which help to quickly correct and get early feedback.
Small steps ensure minimum overhead and faster rollback if needed.

Accepted Responsibility

XP focuses more on accepting the required responsibility rather than enforcing it to any of the team members.

Practices

Things you objectively do day to day. Practices are situation-dependent you may need to adapt to new practices based on the situation, values on the other hand do not have to change like practices.

XP introduces the following primary practices,

Sit Together

The more face time the more humane and productive the project.
Sitting together embraces belonging and effective collaboration.

Whole Team

Include people with all the skills and perspectives necessary.

Informative Workspace

Your workspace should reflect your work.
Someone walking into your team's space should get clarity about what's going on by looking at cards on the wall.

Energized work

Work as many hours as you can genuinely productive.
Do not work more to complete more, work effectively to accomplish necessary for the given day.
More work doesn't mean great work.

Pair Programming

Collaborate in pairs for effective efforts and better quality.
Be open-minded while pairing, listen more.
Learn to get the opinions and perspectives of your pair attentively.

Stories

Divide major releases into smaller releases.
Divide each release into smaller stories.
If required divide stories into smaller tasks.
Ensure to pick the most important items after discussing them with the customer.
Early estimation of stories will help to plan well in iteration.

Weekly cycle

Start with writing automated system tests and work during the week to make them pass.

Quarterly cycle

Identify bottleneck, and find a theme(s) for the quarter.
Reflect and plan with the big picture.
Mostly project manager should do this activity.

Slack

Few met commitments go a long way toward rebuilding relationships.
Deliver what you can with ensured quality and value rather than delivering n number or things with compromised quality and bugs.

Ten-Minute Build

The automated build must finish within 10 minutes for rapid feedback.

Continuous integration

Continuously integrating your changes with automated builds.

Test First Programming (TDD)

Write system-level tests first for better quality and implementation.
Your tests should depict the story and requirements. Making them pass should ensure completion.

Incremental Design

Invest in the design of the system every day, and keep improving.
Design decisions are subjective and depend on the nature of the project as well.

These are primary practices, there are certain corollary practices that may and may not be applicable in all cases but still adds a lot of value to your software development. We may talk about it in separate blog post.

XP iteration may look like the following,

Identify and estimate high-priority items to be delivered in this iteration.
Convert the requirements into stories
Effectively estimate the stories
Start with writing failing tests which likely going to pass by the end of the iteration.
Work in pair to make failing tests pass.
Continuously integrate and get feedback
Reflect after each iteration

This is just the beginning of XP, however, the more your team will practice it the more effective and fruitful it will going to be in the longer run.

Remember in XP,

"Make a change, observe the effects; then digest the change, turning it into a solid habit."

Deep vs Shallow Copy

In programming language like Java, everything revolves around objects. Lifecycle of object from creation to garbage collection is something that keeps happening continuously in any realtime application.

There are different ways you can use to create and Object of a Class. However, the object creation with new keyword is the straightforward one.

Car mercedes = new Mercedes();

While we keep creating objects on-demand, in some realtime scenarios we may need to create new object which is a copy of existing object. In that case new Object should hold same state as current Object is holding.

You may ask, what kind of scenarios require copying existing object?

Well, it completely depends on the software you are developing but in any application where you see copy option, whether it can be copying table row, copying form etc. such cases are good candidates to use object copying mechanism.

There are two approaches you can use to copy object,

Shallow copy

Deep copy

There are different methods to implement copying approach,

Using clone method

Copy constructor

Static factory method

There are certain third party libraries available which provides methods to copy the objects.

For example, BeanUtils#copyProperties(Object destination, Object source)

Which option to choose majorly depends on your requirement.

Shallow copy

All the fields of existing object is copied to new object.

Consider following diagram, while copying Car object, company object reference is reused in the copy object, which simply means shallow copy only copies values (variable and object references).

Basically, it doesn’t create copy of objects referenced inside the object we want to copy, thus it is called shallow copy.

Following is a shallow copy example using copy constructor method,

class Car {

    private String model;

    private Company company;

    public Car(String model, Company company) {
        this.model = model;
        this.company = company;
    }

    public Car(Car carToCopyFrom) {
        this(carToCopyFrom.model, carToCopyFrom.company);
    }

}

Deep copy

All the fields along with the objects referenced by existing object are copied to new object.

Consider following diagram, company object is also copied and reference to this new object is used in car object. Note that all the referenced objects at any level (direct or indirect) are copied and referred in copy object.

Following is a deep copy example using copy constructor method,

class Car {

    private String model;

    private Company company;

    public Car(String model, Company company) {
        this.model = model;
        this.company = company;
    }

    public Car(Car carToCopyFrom) {
        this(carToCopyFrom.model, new Company(carToCopyFrom.company.getName()));
    }
}

BigQuery Pricing and Cost Optimization

In today’s world of information, data analytics is an integral part of every business decision. While analytics is such an important factor in business, the cost of analytics tools and technologies is equally important to ensure a high return on investment and minimum waste of resources.

BigQuery is one of the leading technology of data analytics for all-sized organizations. BigQuery not just helps to analyze the data but also helps organizations in making real-time decisions, reporting, and future predictions.

Architecture

BigQuery is a completely serverless enterprise data warehouse. In BigQuery storage and compute are decoupled so that both can scale independently on demand.

Such architecture offers flexibility to the customers. Customers don’t need to keep compute resources up and running all the time, also, customers no longer need to worry about system engineering and database operations while using BigQuery.

BigQuery has distributed and replicated storage along with a high-availability cluster for computation. We don’t need to provision instances of VMs to use BigQuery, it automatically allocates computing resources as per the need.

Table types

Standard

Structured data and well-defined schema

Clones

Writable copies of a standard table.
Lightweight, as BigQueyr stores the delta between the clone and its base table.

Snapshots

Point-in-time copies of the table.
BigQuery only stores the delta between a table snapshot and its base table.

Materialized views

Precomputed views periodically cache the results of the view query.

External

The only table metadata is kept in BigQuery storage.
Table definition points to an external data store, Cloud Storage.

Dataset

BigQuery organizes tables and other resources into logical containers called datasets.

Key features

Managed

You don’t need to provision storage resources or reserve units of storage.
Pay for the storage you use.

Durable

99.99999999999% (11 9’s) annual durability

Encrypted

Automatically encrypt before writing it to disk
Custom encryption is also possible.

Efficient

Efficient encoding format which optimized analytic workload.

Compressed

Proprietary columnar compression, automatic data sorting, clustering, and compaction.

Pricing Models

Analysis Pricing

Cost of processing queries (SQL, User defined functions, DML, DDL, BQML)

On-Demand	Standard	Enterprise	Enterprise Plus
Pay per byte	Pay per slot hour	Pay per slot hour	Pay per slot hour
$5 per TB	$40 for 100 Slots per hour	$40 for 100 Slots per hour (1 to 3 years commitment discount)	$40 for 100 Slots per hour (1 to 3 years commitment discount)
2000 Concurrent slots (shared among all queries of the project)	Can put a cap on spending.	Standard + Fixed cost setup	Enterprise + Multi-region redundancy and higher compliance.

Storage Pricing

Cost to store the data you load.

	Active (modified in the last 90 days)	Long Term (have not been modified in the last 90 consecutive days)
Logical (Uncompressed)	Starting at $0.02 per GB	Starting at $0.01 per GB
Physical (Compressed)	Starting at $0.04 per GB	Starting at $0.02 per GB

Data Ingestion and Extraction Pricing

For analytics, data need to be ingested into the data platform and may need to be extracted, too.

Data Ingestion	Data Extraction
Batch loading and exporting table data to cloud storage is free using a shared slot pool.	Batch export table data to the Cloud Storage is free when using a shared slot pool.
Streaming inserts, charged for rows successfully inserted. Individual rows are calculated using 1 KB as the minimum. ($0.01 per 200 MB)	Streaming reads uses the storage Read API. Starting at $1.10 per TB.
BigQuery Storage Write API ($0.025 per GB, first 2 TB per month are free)

Ingestion Pricing

Shared Slot Pool

By default not charged for batch loading from Cloud storage or local files using shared pools of slot.
No guarantee of the availability of the shared pool or the throughput.
For large data, the job may wait as slots become available.
If the target BigQuery dataset and Cloud Storage bucket are co-located, network egress while loading is not charged.

Obtain dedicated Capacity

If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
But this will not be free and you will lose access to the free pool as well.

Modes

Single Batch Operation.
Streaming data one record at a time in small batches.

Batch Loading	Storage Write API	Streaming Inserts
Free using shared Pool	$0.025 per GB	$0.01 per 200 MB
For guaranteed capacity choose editions reservations.	The first 2 TB per month is free	Charged for rows inserted. 1 KB as minimum row size.

Extraction Pricing

Shared Slot Pool

By default not charged for batch exporting as it uses a shared pool of slots.
No guarantee of the availability of the shared pool or the throughput.

Obtain dedicated Capacity

If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
But this will not be free and you will lose access to the free pool as well.

Storage Read API

Charged for the number of bytes read. This is calculated based on the data size which is calculated based on the size of each column’s data type.
Charged for any data read in a read session, even if a ReadRow call fails. If ReadRows call is canceled, you will be charged for data read before cancellation.
On-demand pricing with 300 TB per month for each billing account.
Exclusions

Bytes scanned from temporary tables are free and do not count toward 300 TB.
Associated egress cost is not included.

Important: To lower the cost,

Use partitioned and clustered tables.
Reduce the data read with WHERE clause to prune the partition.

What is free of cost in BigQuery?

Cached queries.
Batch loading or export of data.
Automatic re-clustering.
Delete (table, views, partitions, and datasets)
Queries that result in an error.
Metadata operations.

*BigQuery free tier offers 10 GB of storage and 1 TB of query processing per month.

Billing Models

Storage

Billing Models
- Logical
- Physical
  - Charged based on actual bytes stored.
  - If your data compresses well, with Physical storage you can save a good amount of storage and associated cost.

How data is stored?
- Data is stored in compressed format.
- When you run a query, the query engine distributes the work in parallel across multiple workers.
- Workers scan the relevant tables in storage, process the query and then gather the result.
- BigQuery executes queries completely in memory, using the petabit network to move data extremely fast to the worker nodes.

BigQuery stores data in columnar format.

Compute

Billing Models:

On-Demand

Charged for the number of bytes processed
First 1 TB is free every month

Editions Reservation

Charge for the number of slot_sec (one slot one second) used by the query.
slot is a unit of measure for BigQuery compute power.
Ex. query using 100 slots for 10 seconds will accrue 1000 slot_sec.

You can mix and match these models to suit different needs.

Decision flow for consumption model.

Custom Quotas

Set a maximum amount of bytes that can be processed by a single user on a given billing project.

When the limit is passed the user gets a ‘quota exceeded’ error.

Best Practices

Ingestion and extraction

Use a free shared slot pool whenever possible, only choose dedicated capacity in case of large data only.

Avro and Parquet file formats provide better performance on load.

Compressed files take longer to load in BigQuery. To optimize performance when loading, uncompress your data first.

In the case of Avro files, compressed files load faster than uncompressed files.

Use native connectors to read/write data to BigQuery for better performance rather than building custom integrations.

Storage Best Practices

Use long-term storage (50% Cheaper)

Use time travel wisely
- Recover from mistakes like accidental modification or deletion.
- Default tabl expiration for transient dataset
- Expiration for partition and table

Use snapshots for longer backups
- Time travel works for the past 7 days period only
- Snapshots of a particular time can be stored for as long time as you want.
- Minimise storage cost as BigQuery stores only bytes that are different between snapshots and its base table.
- Important: No initial storage cost for snapshot, change in the base table, and if the data also exists on a snapshot, you will be charged for storage of changed or deleted data.

Use clones for modifiable copies of production data
- Lightweight copy of the table.
- Independent of the base table, any changes made to the base will not reflect in the clone.
- The cost for table clone is changed data plus new data.

Archive data into a new BigQuery table

Move ‘cold data’ to Google Cloud Storage

Workload Management

Use multiple billing projects

Mixing and switching pricing models

Know how many slots you need

Use separate reservations for compute-intensive work
- Use baseline slots for critical reservations
- Use commitments for sustained usage

Take advantage of slots sharing

A dynamic approach to workload management

Compute cost optimization

Follow SQL Best practices
- Clustering
- Partitioning
- Select only the column you need and curate filtering, ordering, and sharding
- Denormalize if needed
- Choose the right function and pay attention to Javascript user-defined function
- Choose the right data types
- Optimize join and common table expressions
- Look for anti-patterns

Use BI Engine to reduce the computing cost

Use Materialized views to reduce costs

Keep a close watch on the cost

Budget alerts

BigQuery reports

BigQuery admin resource charts

Looker studio dashboard

Conclusion

BigQuery offers flexibility in choosing the best suitable options for your requirement for storage and computation. You should be conscious about opting for the right match, such that you won't end up with a situation of starvation or waste of resources.

References:

https://cloud.google.com/blog/products/data-analytics/introducing-new-bigquery-pricing-editions
https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview