<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Mike Ramos Tech]]></title><description><![CDATA[Thoughts and ideas from a Senior Software Engineer.]]></description><link>https://mikeramos.tech/</link><image><url>https://mikeramos.tech/favicon.png</url><title>Mike Ramos Tech</title><link>https://mikeramos.tech/</link></image><generator>Ghost 5.88</generator><lastBuildDate>Sun, 10 May 2026 18:46:57 GMT</lastBuildDate><atom:link href="https://mikeramos.tech/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Lessons from modernizing 10 services using Generative AI]]></title><description><![CDATA[<p>I joined a new team at IBM a few months back, with the goal of creating MCP servers for many of the internal APIs that support employees. As I began getting up to speed on the project, I realized the APIs could use some love. These APIs are heavily used</p>]]></description><link>https://mikeramos.tech/lessons-from-modernizing-10-services-using-generative-ai/</link><guid isPermaLink="false">6987d67c59e9e334acefa3ca</guid><dc:creator><![CDATA[Michael Ramos]]></dc:creator><pubDate>Mon, 09 Feb 2026 14:00:32 GMT</pubDate><content:encoded><![CDATA[<p>I joined a new team at IBM a few months back, with the goal of creating MCP servers for many of the internal APIs that support employees. As I began getting up to speed on the project, I realized the APIs could use some love. These APIs are heavily used production services, but they lack much testing, documentation, and coding standards. I was worried what would happen if I needed to make changes to the API while I created the MCP servers. I decided I needed to first take a pass at improving the APIs, and I thought this would be a great opportunity to use generative AI.</p><p>For this project, I used <a href="https://www.ibm.com/products/bob?ref=mikeramos.tech" rel="noreferrer">IBM Bob</a>. Bob is an AI-powered development partner that helps write and manage code. This is my first experience using AI powered coding tools, and I&apos;d like to share what I&apos;ve learned. If you&apos;re new to using AI at work, you might find these tips useful in your projects too.</p><h2 id="start-with-learning-about-the-project">Start with learning about the project</h2><p>The most important thing to tackle early is understanding the project you&apos;re working on. Since I was new to the team, I didn&apos;t have any context around the history of the projects, and I didn&apos;t even know what most of these services did. To make matters worse, most of the projects were lacking any documentation. Before changing any code, I asked Bob to document the project so I could begin to understand what the service did and how it fit in the overall stack.</p><p>Here&apos;s the real prompt I used during this stage:</p><pre><code class="language-txt">Task 1: Documentation

Requirements

The repository must have a concise, newcomer-friendly README.md that includes:
 &#x2022; What this repository does (1&#x2013;2 paragraphs)
 &#x2022; Technologies used
 &#x2022; High-level architecture (use Mermaid diagrams)
 &#x2022; How to run locally (prerequisites + commands)
 &#x2022; How tests are run
 &#x2022; How this service is deployed (if applicable)

If the README is missing or incomplete:
 &#x2022; Create or rewrite it
 &#x2022; Keep it concise and skimmable
 &#x2022; Assume the reader has never seen this repo before
</code></pre>
<p>This resulted in documentation that I could easily consume and get an overall understanding of the project. This understanding helps guide the follow-up steps.</p><h2 id="plan-mode-is-your-friend">Plan mode is your friend</h2><p>Plan mode is available on many AI tools now, and I found it to be incredibly useful for this project. By using plan mode, you give the AI opportunity to think through all changes before just jumping into coding. This also helped me think through exactly what I wanted to accomplish. Practically, using plan mode means you&apos;re less likely to find the AI gets confused or misinterpreted your prompt.</p><p>During this project, I followed a workflow:</p><ol><li>Craft what I thought was a great prompt and ask the AI to create a <strong>phased</strong> plan for implementation. Write the plan to markdown.</li><li>In a loop, as the AI to implement the phases and update the plan with status. For each implementation phase, I started a new task to clear context.</li><li>Commit the changes in between tasks to save state.</li></ol><p>If the plan needed adjusting, there was opportunity to do so between tasks. However, if things got too off track, I found it easier to scrap the implementation in progress and go back to planning from the beginning.</p><h2 id="dont-overwhelm-your-reviewers">Don&apos;t overwhelm your reviewers</h2><p>The same best practices we used before AI generated code are still important. For this project, that meant splitting my changes into multiple pull requests so I could keep the size of the changes small. I decided to organize the changes by level of risk. First, documentation only changes, testing improvements next, followed by code changes. This makes for two pull requests that should be very easy to review, and a third that is now well documented and tested to aid reviewers.</p><p>At IBM, we don&apos;t have any AI tools specifically created for reviewing pull requests yet, but these would help. I&apos;ve experimented with using the same tool that writes code to review it, but there is high risk with this approach: using the same model could mean making the same mistakes. If available, a tool for reviewing AI generated code could improve the bottleneck of pull requests.</p><h2 id="dont-boil-the-ocean">Don&apos;t boil the ocean</h2><p>I had goals at the beginning of the project to improve documentation and testing, and I had to restrain myself from rewriting entire services. AI tools can make it so easy to quickly make large changes, which feels so powerful in the moment. However, in my case, these are production services and any changes require careful reviews. Even though it is easy to make sweeping changes, I&apos;m not an expert in the project and reviews would be lengthy. Instead, set targeted goals and stick to them.</p><p>In the end, using generative AI didn&#x2019;t change the fundamentals of good engineering. Clear goals, thoughtful planning, small and reviewable changes, and a solid understanding of the codebase still mattered just as much as before. Tools like Bob helped me move faster and build confidence as I improved unfamiliar services, but the real value came from using AI deliberately and not letting it run unchecked. If you&#x2019;re new to AI-powered coding tools, start small, stay disciplined, and treat them as a partner.</p><p></p><ol><li>Start with learning about the project</li><li>Set standards early</li><li>Plan mode is your friend</li><li>A consistent prompt makes for maintainable projects</li><li>Don&apos;t overwhelm your reviewers</li><li>Before changing any code, add tests</li><li>Use AI to look for inconsistencies</li><li>Don&apos;t boil the ocean</li></ol>]]></content:encoded></item><item><title><![CDATA[IBM Tech 2025 - My Thoughts]]></title><description><![CDATA[<p>I just got back from Tech 2025, IBM&apos;s premier recognition, education, and development program for top technical IBMers. This year&apos;s event was in San Diego and I had a great time. The program is a mix of fun events and educational sessions, and in this post</p>]]></description><link>https://mikeramos.tech/ibm-tech-2025-my-thoughts/</link><guid isPermaLink="false">6858240659e9e334acefa363</guid><dc:creator><![CDATA[Michael Ramos]]></dc:creator><pubDate>Sun, 22 Jun 2025 17:06:51 GMT</pubDate><content:encoded><![CDATA[<p>I just got back from Tech 2025, IBM&apos;s premier recognition, education, and development program for top technical IBMers. This year&apos;s event was in San Diego and I had a great time. The program is a mix of fun events and educational sessions, and in this post I&apos;ll cover some of my takeaways from the technical sessions.</p><figure class="kg-card kg-image-card"><img src="https://mikeramos.tech/content/images/2025/06/IMG_0505.jpg" class="kg-image" alt="Six IBM Tech attendees posing near event signage" loading="lazy" width="2000" height="1500" srcset="https://mikeramos.tech/content/images/size/w600/2025/06/IMG_0505.jpg 600w, https://mikeramos.tech/content/images/size/w1000/2025/06/IMG_0505.jpg 1000w, https://mikeramos.tech/content/images/size/w1600/2025/06/IMG_0505.jpg 1600w, https://mikeramos.tech/content/images/size/w2400/2025/06/IMG_0505.jpg 2400w" sizes="(min-width: 720px) 720px"></figure><h2 id="leadership-behaviors">Leadership Behaviors</h2><p><a href="https://newsroom.ibm.com/Dinesh-Nirmal?ref=mikeramos.tech" rel="noreferrer">Dinesh Nirmal</a> kicked off the event with an opening session focused on key behaviors of a technology leader. I found this session insightful, especially as I compare with the behaviors described in <a href="https://www.andrewmcafee.org/the-geek-way?ref=mikeramos.tech" rel="noreferrer">The Geek Way</a>, which is constantly being referenced by our senior leadership. </p><p>Here are the six key behaviors Dinesh highlighted:</p><ul><li>Stay Curious</li><li>Move Toward the Unknown</li><li>Collaborate at Speed</li><li>Simplify Relentlessly</li><li>Lead With Data</li><li>Persist With Purpose</li></ul><figure class="kg-card kg-image-card"><img src="https://mikeramos.tech/content/images/2025/06/IMG_5349.jpeg" class="kg-image" alt="Dinesh presenting in front of a slide showing the six leadership behaviors" loading="lazy" width="2000" height="1500" srcset="https://mikeramos.tech/content/images/size/w600/2025/06/IMG_5349.jpeg 600w, https://mikeramos.tech/content/images/size/w1000/2025/06/IMG_5349.jpeg 1000w, https://mikeramos.tech/content/images/size/w1600/2025/06/IMG_5349.jpeg 1600w, https://mikeramos.tech/content/images/size/w2400/2025/06/IMG_5349.jpeg 2400w" sizes="(min-width: 720px) 720px"></figure><p>Of these, Lead With Data stands out especially. There&apos;s strong overlap with the science norm discussed in The Geek Way here. Dinesh points out that we should be gathering data in order to make better decisions, but we can go further too. It&apos;s not just what the data says, but it&apos;s what we do with it.</p><h2 id="ai-assistants-are-the-pasthow-are-agents-shaping-the-future">&quot;AI Assistants&quot; Are the Past - How Are &quot;Agents&quot; Shaping the Future?</h2><figure class="kg-card kg-image-card"><img src="https://mikeramos.tech/content/images/2025/06/IMG_5373.jpeg" class="kg-image" alt="A slide showing the evolution of assistance, from traditional to single agent to multi-agent" loading="lazy" width="2000" height="1500" srcset="https://mikeramos.tech/content/images/size/w600/2025/06/IMG_5373.jpeg 600w, https://mikeramos.tech/content/images/size/w1000/2025/06/IMG_5373.jpeg 1000w, https://mikeramos.tech/content/images/size/w1600/2025/06/IMG_5373.jpeg 1600w, https://mikeramos.tech/content/images/size/w2400/2025/06/IMG_5373.jpeg 2400w" sizes="(min-width: 720px) 720px"></figure><p>This session was the one I found to be the most exciting. The talk focused on new capabilities recently added to <a href="https://www.ibm.com/products/watsonx-orchestrate?ref=mikeramos.tech" rel="noreferrer">watsonx Orchestrate</a>, particularly the agentic AI features. These agents are powered by the think-act-observe loop and have the ability to interact with other systems, making them much more dynamic than a chat bot. In addition, the team has released the Agent Developer Kit (<a href="https://developer.watson-orchestrate.ibm.com/?ref=mikeramos.tech" rel="noreferrer">ADK</a>) that can be used for building and deploying tools and agents to Orchestrate. There&apos;s great potential now for a CI/CD pipeline to build, test and deploy agents to Orchestrate with the ADK.</p><p>Many of the other sessions were also focused around AI and ways to improve our workflows. I&apos;m looking forward to experimenting with the ADK and seeing where agent-based workflows could add value for my team.</p>]]></content:encoded></item><item><title><![CDATA[Leveling Up Your CI/CD Pipeline with Security Best Practices]]></title><description><![CDATA[<p><br>In my previous post, I walked through the foundational steps of a solid CI/CD pipeline including the essential tasks to build and deploy code. Getting your code to production quickly is good, but a great pipeline also makes sure that what you&#x2019;re shipping is&#xA0;secure, verified,</p>]]></description><link>https://mikeramos.tech/leveling-up-your-ci-cd-pipeline-with-security-best-practices/</link><guid isPermaLink="false">6831f32d59e9e334acefa2b6</guid><category><![CDATA[cicd]]></category><category><![CDATA[automation]]></category><category><![CDATA[devops]]></category><dc:creator><![CDATA[Michael Ramos]]></dc:creator><pubDate>Fri, 13 Jun 2025 17:28:08 GMT</pubDate><content:encoded><![CDATA[<p><br>In my previous post, I walked through the foundational steps of a solid CI/CD pipeline including the essential tasks to build and deploy code. Getting your code to production quickly is good, but a great pipeline also makes sure that what you&#x2019;re shipping is&#xA0;secure, verified, and compliant.</p><p>By weaving security into each stage of your pipeline, you can catch issues early, reduce risk, and build trust in every release. Here, I&apos;ll share some of what I learned from building secure pipelines for tens of thousands of repositories.</p><h2 id="security-scans">Security Scans</h2><p>Security scanning is one of the easiest and most impactful improvements you can make. These scans will find issues and vulnerabilities in your code that you can then either fix or improve. There are many different types of scans, each with their own focus. Here are the most important types to consider.</p><h3 id="software-composition-analysis">Software Composition Analysis</h3><p>Software composition analysis (SCA) focuses on the open source dependencies included in your project. These scanners can identify dependencies and their vulnerabilities, giving insight into areas that might need updates. In addition, these tools can create a software bill of materials (SBOM), which will be very useful for the Security Gates section.</p><p>In my experience, I&apos;ve implemented this type of scan in two ways: directly scanning the application after installing dependencies and scanning the built container image. The container image will give a more complete scan, including vulnerabilities coming from any base image in use. A scan earlier in the pipeline will be more focused, giving a high confidence that any findings are a direct result of the application.</p><p><a href="https://trivy.dev/latest/?ref=mikeramos.tech" rel="noreferrer">Trivy</a> is a great open source scanning tool in this area.</p><h3 id="detecting-secrets">Detecting Secrets</h3><p>Sometimes developers accidentally add secrets to their GitHub repository. I use an open source tool called <a href="https://github.com/IBM/detect-secrets-stream?ref=mikeramos.tech" rel="noreferrer">Detect Secrets</a> to help discover when this has happened as part of the pipeline. It isn&apos;t ideal, because if this tool finds anything then the secret has already leaked, but it can identify the problem so that it can be resolved immediately.</p><h3 id="static-analysis">Static Analysis</h3><p>Static analysis is a scan of your source code that can recommend improvements as well as find vulnerabilities present. This should be run before the artifact is built, and can result in a variety of recommendations. The results of this scan can be output in the <a href="https://sarif.info/?ref=mikeramos.tech" rel="noreferrer">SARIF</a> format, which is a machine readable output that can be persisted for analysis.</p><h3 id="dynamic-analysis">Dynamic Analysis</h3><p>Dynamic analysis is another scanning option, but this scan runs against your deployed application. It does not have access to the source code of the application, and instead will attempt to find vulnerabilities by probing the actual application. Therefore, dynamic analysis should be run after deployment only in non-production environments.</p><p><a href="https://www.zaproxy.org/?ref=mikeramos.tech" rel="noreferrer">Zed Attack Proxy</a> is an open source option for running dynamic security testing.</p><h2 id="signing-build-artifacts">Signing Build Artifacts</h2><p>All build artifacts should be signed with a key that is only available to your CI system. This will prove that any build artifact with this signature was created by your secure build system, and included all of the security testing that is expected.</p><p>While all build artifacts should be signed, if you&apos;re creating container images <a href="https://github.com/sigstore/cosign?ref=mikeramos.tech" rel="noreferrer">cosign</a> is a great open source tool for signing the image.</p><h3 id="signature-verification">Signature Verification</h3><p>While not part of the CI/CD pipeline itself, your deployment environments should be configured to validate the signature of everything deployed before it is run. Without this step, signing the images is a nice addition but not enforced for protection.</p><h2 id="audit-readiness">Audit Readiness</h2><p>With all of these new security tasks added to your pipeline, you have a lot of data about the applications being built. This includes all of the dependencies, vulnerability data, versions, and information about the specific changes made. Saving this data can make audits run more smoothly, giving confidence that vulnerabilities are taken seriously and quickly resolved.</p><p>Your results from SCA will be among the most valuable. These should result in an SBOM, and using a standard format will make further analysis more straightforward. <a href="https://cyclonedx.org/specification/overview/?ref=mikeramos.tech" rel="noreferrer">CycloneDX</a> is a popular format for this requirement.</p><p>With this information saved, you can quickly show your developers which dependencies are responsible for vulnerabilities, even if they are not direct dependencies.</p><h2 id="security-gates">Security Gates</h2><p>The final improvement to your pipelines is to apply policies to prevent risky changes from deploying. For example, if you would like to block all releases with a critical vulnerability, you can do so by running a policy task after the scans are completed. A great open source tool for applying policy checks is the <a href="https://www.openpolicyagent.org/?ref=mikeramos.tech" rel="noreferrer">Open Policy Agent</a>.</p><p>These additions take your pipeline from simple CI/CD to an advanced pipeline ready to securely deploy changes.</p>]]></content:encoded></item><item><title><![CDATA[What Makes a CI/CD Pipeline Great?]]></title><description><![CDATA[<p>Whether you&apos;re shipping a mobile app or deploying a backend service, your Continuous Integration and Continuous Deployment (CI/CD) pipeline is the gatekeeper of quality and security. A good pipeline can catch bugs in minutes, but a great one helps prevent incidents before they reach production. Here, I&</p>]]></description><link>https://mikeramos.tech/what-makes-a-ci-cd-pipeline-great/</link><guid isPermaLink="false">6831e81059e9e334acefa255</guid><category><![CDATA[cicd]]></category><category><![CDATA[automation]]></category><category><![CDATA[devops]]></category><dc:creator><![CDATA[Michael Ramos]]></dc:creator><pubDate>Tue, 27 May 2025 20:32:45 GMT</pubDate><content:encoded><![CDATA[<p>Whether you&apos;re shipping a mobile app or deploying a backend service, your Continuous Integration and Continuous Deployment (CI/CD) pipeline is the gatekeeper of quality and security. A good pipeline can catch bugs in minutes, but a great one helps prevent incidents before they reach production. Here, I&apos;ll review the essentials for a CI/CD pipeline, and in a future post I will highlight where security can help go above and beyond.</p><h2 id="cicd-essentials">CI/CD Essentials</h2><p>First, the basics. CI/CD is all about giving fast feedback to developers, enabling more frequent deployments to production. To facilitate this, a collection of basic CI tasks can be especially helpful.</p><h3 id="dependency-install">Dependency Install</h3><p>As part of CI, you should be installing dependencies (preferably from a lock file) to validate the correct configuration. Because dependencies are often external to the codebase, it is likely necessary to install these dependencies for the other CI tasks.</p><h3 id="compilation">Compilation</h3><p>If your language supports compilation (like Java, Go, or Rust), it is one of the most straightforward tests available. Running compilation tasks as part of CI will ensure the code builds cleanly.</p><h3 id="unit-testing">Unit Testing</h3><p>Every CI pipeline needs to have unit testing to give quick feedback to your developers and detect any regressions in their code changes. Unit tests aren&apos;t perfect &#x2013; they&apos;ll only catch the problems you&apos;re looking for &#x2013; but they can be a great tool for establishing a baseline quality.</p><h3 id="linting">Linting</h3><p>A lint task in the CI pipeline will be useful if there are code style standards to maintain. Not every application will require this, but it is a great addition to catch when someone is adding tabs to a codebase that uses spaces.</p><h3 id="versioning">Versioning</h3><p>Each build needs a unique version to support traceability, debugging and rollback. If appropriate for the language, <a href="https://semver.org/?ref=mikeramos.tech" rel="noreferrer">Semantic Versioning</a> is a great way to manage versions. The CI system should generate one automatically if developers do not set a version explicitly.</p><h3 id="packaging">Packaging</h3><p>A packaging step gives developers an opportunity to move files in preparation for publishing the build artifact. This simple step ensures the artifact has only the essentials included.</p><h3 id="publishing">Publishing</h3><p>The culmination of the CI pipeline results in a build artifact being published. These days, this is typically a container image pushed to a registry, but it could also be a JAR file published to a remote repository, or even just a collection of files ready to be moved to a server. </p><h3 id="deploying">Deploying</h3><p>There are a multitude of different options for deployments, but I like to deploy feature branches to dedicated test environments and deploy the main branch to a persistent non-production environment followed by production. This allows developers to test their changes in an isolated environment before they merge, plus protects the production environment from any changes that do not pass tests in preprod.</p><h3 id="deployment-verification">Deployment Verification</h3><p>Finally, once the deployment is complete it is useful to run some tests to validate the change. These could include end-to-end or smoke tests, but should run quickly to determine if there are any issues with the application. This could be as simple as using curl to validate the deployment, or running Cypress for more advanced tests.</p><p>With these building blocks in place, your CI/CD pipeline becomes a reliable foundation for shipping software. Next, we&#x2019;ll look at how to level up &#x2014; with security scans, image signatures, and policy enforcement.</p>]]></content:encoded></item><item><title><![CDATA[Local PySpark Development for Apache Iceberg]]></title><description><![CDATA[<p><a href="https://iceberg.apache.org/?ref=mikeramos.tech" rel="noopener noreferrer nofollow">Apache Iceberg</a>&#xA0;is a new-ish format for working with large sets of data. Iceberg is very powerful as a production data lake or warehouse, but it can be tricky to configure a local development environment properly. In this blog, I will cover how I have configured local tables using</p>]]></description><link>https://mikeramos.tech/local-pyspark-development-for-apache-iceberg/</link><guid isPermaLink="false">67f1ca9959e9e334acefa1d6</guid><category><![CDATA[Iceberg]]></category><category><![CDATA[Spark]]></category><category><![CDATA[PySpark]]></category><dc:creator><![CDATA[Michael Ramos]]></dc:creator><pubDate>Sun, 06 Apr 2025 21:43:40 GMT</pubDate><content:encoded><![CDATA[<p><a href="https://iceberg.apache.org/?ref=mikeramos.tech" rel="noopener noreferrer nofollow">Apache Iceberg</a>&#xA0;is a new-ish format for working with large sets of data. Iceberg is very powerful as a production data lake or warehouse, but it can be tricky to configure a local development environment properly. In this blog, I will cover how I have configured local tables using Iceberg and PySpark so that changes can be validated before deploying to production.</p><h2 id="local-environment-setup">Local Environment Setup</h2><p>In order to configure your local development environment, you&#x2019;ll first need to install a few tools.&#xA0;</p><ul><li><strong>Python</strong>: You probably already have Python, but a fairly modern version is required. If you need to install Python, I have found&#xA0;<a href="https://github.com/pyenv/pyenv?ref=mikeramos.tech" rel="noopener noreferrer nofollow">pyenv</a>&#xA0;to be very helpful. I&#x2019;m using Python version 3.11 locally.</li><li><strong>Poetry</strong>:&#xA0;<a href="https://python-poetry.org/?ref=mikeramos.tech" rel="noopener noreferrer nofollow">Poetry</a>&#xA0;is a tool that helps manage Python dependencies and automatically manages virtual environments. I&#x2019;m using Poetry version 2.1.1.</li><li><strong>Java</strong>: Spark requires Java in order to run, and I needed to update to a newer version for this application. I used&#xA0;<a href="https://formulae.brew.sh/formula/openjdk@17?ref=mikeramos.tech" rel="noopener noreferrer nofollow">home-brew</a>&#xA0;to update to openJDK 17.</li></ul><p>From this point, there&#x2019;s just a bit more additional setup required. I like to use the poetry shell command to run a subshell and activate the correct virtual environment for the project, but Poetry does not ship with this plugin installed. We can add it by running Poetry self add poetry-plugin-shell. If your project already uses Poetry, you can install the dependencies with the Poetry install command. If you&#x2019;re starting a new project, you can run Poetry new &lt;project&gt; and be sure to add PySpark as a dependency.</p><p>Next, if you updated your Java version, make sure that version is on the PATH properly. I ran this command:</p><pre><code class="language-bash">echo &apos;export PATH=&quot;/usr/local/opt/openjdk@17/bin:$PATH&quot;&apos; &gt;&gt; ~/.zshrc
</code></pre>
<h3 id="managing-iceberg-dependencies">Managing Iceberg Dependencies</h3><p>In order to use Iceberg with Spark, you&#x2019;ll need to add the correct JAR file when running your spark commands.&#xA0;<a href="https://iceberg.apache.org/docs/nightly/spark-getting-started/?ref=mikeramos.tech" rel="noopener noreferrer nofollow">This documentation</a>&#xA0;from Iceberg outlines which JAR should be used and how to configure Spark correctly. In addition,&#xA0;<a href="https://iceberg.apache.org/multi-engine-support/?ref=mikeramos.tech#apache-spark" rel="noopener noreferrer nofollow">this page</a>&#xA0;documents which of the available JARs are still supported.</p><p>When running locally, it is important to pass in the JAR files when running the script. You can do this with either the --packages or --jars flag. The packages flag has the ability to automatically download (and cache) the specified JAR and can be easier to use. Alternatively, you can download the Iceberg JAR yourself and pass it in with the jars flag.</p><p>Once you&#x2019;ve added the JAR files, you also need to configure the Iceberg catalog. You can do this either though the command line options or by adding config options when you initialize your Spark session.&#xA0;<a href="https://iceberg.apache.org/docs/1.8.1/spark-configuration/?ref=mikeramos.tech#catalogs" rel="noopener noreferrer nofollow">This Iceberg documentation</a> describes the configuration options available. For my local testing, I&#x2019;ve used these options:</p><pre><code>spark.sql.catalog.local = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type = hadoop
spark.sql.catalog.local.warehouse = spark-warehouse/iceberg
</code></pre>
<h3 id="environment-variables">Environment Variables</h3><p>When running locally, it is important to make sure you also set some environment variables correctly. These are documented <a href="https://spark.apache.org/docs/latest/configuration.html?ref=mikeramos.tech#environment-variables" rel="noreferrer">here</a>. Of these, these are the ones that I&apos;ve found are most important:</p><table>
<thead>
<tr>
<th>Variable Name</th>
<th>Variable Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>JAVA_HOME</td>
<td>Set if <code>java</code> is not on your <code>$PATH</code></td>
</tr>
<tr>
<td>PYSPARK_PYTHON</td>
<td>The default should be OK here, otherwise point to your python executable</td>
</tr>
<tr>
<td>PYSPARK_DRIVER_PYTHON</td>
<td>The default should again be ok, otherwise point to your python executable</td>
</tr>
<tr>
<td>SPARK_LOCAL_IP</td>
<td>127.0.0.1</td>
</tr>
</tbody>
</table>
<p>With this configuration done, you should now be able to use Iceberg with Spark locally.</p><h2 id="creating-tables">Creating Tables</h2><p>The best way to test this configuration is to try to create some tables for use locally. Start by initializing Spark using the Iceberg configurations:</p><pre><code class="language-python"># Initialize Spark session with Iceberg configurations
spark:SparkSession = SparkSession.builder \
.appName(os.getenv(&apos;APP_NAME&apos;)) \
.config(&apos;spark.jars.packages&apos;, &apos;org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2&apos;) \
.config(&quot;spark.sql.extensions&quot;, &quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&quot;) \
.config(&quot;spark.sql.catalog.local&quot;, &quot;org.apache.iceberg.spark.SparkCatalog&quot;) \
.config(&quot;spark.sql.catalog.local.type&quot;, &quot;hadoop&quot;) \
.config(&quot;spark.sql.catalog.local.warehouse&quot;, &quot;spark-warehouse/iceberg&quot;) \
.getOrCreate()
</code></pre>
<p>You can use this object to run SQL commands. spark.sql(sqlQuery) will run whatever SQL command you pass in. When working with Iceberg, you can reference&#xA0;<a href="https://iceberg.apache.org/docs/1.8.1/spark-ddl/?ref=mikeramos.tech" rel="noopener noreferrer nofollow">this page</a>&#xA0;for the valid DDL. Pay special attention to the&#xA0;<a href="https://iceberg.apache.org/docs/1.8.1/spark-getting-started/?ref=mikeramos.tech#spark-type-to-iceberg-type" rel="noopener noreferrer nofollow">Spark to Iceberg type</a> conversion that will automatically take place as well.</p><p>Try creating a table with:</p><pre><code class="language-python">spark.sql(f&quot;CREATE TABLE {APP_NAME}.db.sample (id bigint NOT NULL COMMENT &apos;unique id&apos;, data string) USING iceberg;&quot;)
</code></pre>
<p>To write data to the table, you can also use the spark.sql() function. The options for writing to the Iceberg table are outlined&#xA0;<a href="https://iceberg.apache.org/docs/1.8.1/spark-writes/?ref=mikeramos.tech#insert-into" rel="noopener noreferrer nofollow">here</a>. For a simple test, you can try to directly insert into the table with:</p><pre><code class="language-python">spark.sql(&quot;INSERT INTO local.db.sample VALUES (1, &apos;a&apos;), (2, &apos;b&apos;)&quot;);
</code></pre>
<p>You can also verify that everything is working correctly by checking your filesystem to confirm the warehouse was created with the correct tables.</p><p>Append data using a DataFrame when you&#x2019;re ready to write actual data. You can read more about that&#xA0;<a href="https://iceberg.apache.org/docs/1.8.1/spark-writes/?ref=mikeramos.tech#appending-data" rel="noopener noreferrer nofollow">here</a>.</p><h2 id="querying-tables">Querying Tables</h2><p>To double check that your data is being written properly, you can try to query the Iceberg table. Again, you can query with simple SQL commands using the spark.sql() function. You can check for the data you&#x2019;ve just written by running:</p><pre><code class="language-python">spark.sql(&quot;SELECT * FROM local.db.sample;&quot;).show();
</code></pre>
<p>This will print the rows that you&#x2019;ve written to the table so far. When you&#x2019;re ready, you can also&#xA0;<a href="https://iceberg.apache.org/docs/1.8.1/spark-queries/?ref=mikeramos.tech#querying-with-dataframes" rel="noopener noreferrer nofollow">query using DataFrames</a>.</p><h2 id="summary">Summary</h2><p>Now you can create local Iceberg tables with PySpark, as well as read and write data to the tables. If you&#x2019;re using Iceberg in production, this will help you learn about how it works, as well as allow you to test all changes locally.</p>]]></content:encoded></item></channel></rss>