target tech

2017-12-07T00:00:00-00:00

REDstack is Now Open Source!

We are officially open sourcing REDstack, our sandbox tool for Big Data development at Target.

What is REDstack?

REDstack is a tool for provisioning kerberized clusters on OpenStack. We created it with four goals in mind:

Provide a secured environment, with the ability to leverage preconfigured LDAP and Kerberos servers.
Out of the box usability, allowing you to log in with preconfigured user accounts.
Custom user management utilities to administer the cluster.
Provide a fully customizable experience, everything is a configuration option in your build files:
- Cluster size, node sizes, types of nodes and node roles,
- Hadoop configurations, heap sizes, and components,
- All users, passwords, and secure assets.

Components

REDstack is made up of two major components:

hdp-cloud - The cookbook
- The cookbook is used by the application itself to install components and lay down cluster configuration.
- The cookbook can be used independently of REDstack to manually provision a cluster.
REDstack - The orchestration component
- REDstack is a python application that performs all of the high-level complexities and timings associated with a full Hadoop installation:
  - Orchestrates the provisioning of resources over OpenStack APIs,
  - Controls and monitors parallel Chef deployment across the cluster,
  - Manages and monitors cluster component install over HTTPS requests.
REDstack is bundled with a Docker image, where the configs are set up locally before an installation, and all of the dependencies are updated and configured.

How to Get Started

Head over to the repository at https://github.com/target/redstack and follow along. The repo has instructions on how to build and configure the clusters using the included Docker image.

History of the Project

Target’s Big Data Platform Team manages multiple Big Data environments, with hundreds of nodes and many PB’s of data. As mentioned in our prior blog posts, we depend heavily on Chef as a core part of our CI/CD pipeline. During our testing a couple of years ago, we identified a large opportunity to provide a way to do full integration testing with our Chef cookbooks. This opportunity opened the door for a new product.

Origins

Early on, we released a product internally called Pushbutton. This was a three-node cluster that ran on a standard issue laptop at Target. It was kerberized, and it worked well as a Sandbox environment for developers because it had the same security setup. PushButton, however, did not use the same cookbooks as the main cluster. We wanted something that could do what PushButton did, but with our real cookbooks in a larger environment. By creating a full cluster from scratch, we would be able to understand exactly how the cookbooks would function when applied to new nodes and make sure the cookbooks were in a constant working state. We also needed a little bit larger of a sandbox, otherwise we would have difficulty testing high availability (HA) components, or those that run across multiple nodes, like Apache Zookeeper.

Toward REDStack

Our early exploration started out by trying to adapt the existing PushButton work onto OpenStack. We used shell scripts to automate creation of instances, Knife to bootstrap the nodes with the Chef recipes, and cURL requests to automate and monitor the install process. We got it working, but we still had to face our biggest challenge yet, integrating the cookbooks meant for physical hardware onto virtualized nodes. They were expecting particular configurations such as physical drive formatting and partitioning, or where master services are already defined and running. Instead, we had to get them working with our minimum 1x50GB virtual volumes, and anything we changed would have to still be working on the physical nodes. After some difficult work, and with some clever tricks with attributes and Chef injection, we were able to preconfigure the nodes to be recognized by the recipes and were able to slowly commit our changes back to the ecosystem without impact on the main cluster’s health.

A Product is Born

By this point, we had written the entire process as a Python application and set it up on a nightly loop. Every night, it would build an entire Hadoop cluster, from scratch, smoke test it, and report the results to the team. And it worked! Word started to spread among the organization and we were suddenly getting requests from users of our production cluster. They wanted to use REDstack to spin up a sandbox for them to use for Hadoop. Not only would it be more powerful, and sharable by multiple people on a team, it would look exactly like our production cluster because it uses all of the same configurations and assets.

Opportunity!

As a data engineer, wouldn’t it be nice if I could have a kerberized sandbox environment that looked similar to a production cluster, was easy to work with and user friendly? This is what was possible with these environments, so we started to try and provide them to teams. It didn’t work very well initially, users cloned the repo and ran it on their laptops and ran into all sorts of issues with dependencies and versions with Chef versions, ruby versions, gem versions, and python versions. There were simply too many dependencies to manage on on different environments, even with existing dependency management tools. We needed a way to hide all of the complexity and eliminate the need for users run anything on their computer.

The Full-Stack Service

This led us to the development of our full-stack cluster delivery service, Stacker. Stacker is an API running on top of a database, orchestrating REDstack deploys in threads and listening for requests over a front-end web page. Users simply submit a request on the website and a cluster will be delivered to them in about an hour. At this time, there have been more than 500 unique cluster requests and at least 30 teams are actively using REDstack as a part of their development process.

Ongoing Development

Over time, REDstack has evolved to include multiple types of Big Data clusters. We now provide Elasticsearch clusters as well as Druid in addition to the original Hadoop clusters. Our service continues to evolve with new releases and versions of the software, and additional ease-of-user work on our build in functions for user management and cluster administration.

About the Author

Eric Krenz is a Senior Data Engineer on the Big Data Platform Team at Target.

REDstack was originally published by Target Brands, Inc at target tech on December 07, 2017.

2017-06-27T00:00:00-00:00

K8Guard is Officially Open Source

I am happy to announce that Target has open sourced K8Guard. I have been part of designing and developing it for the past few months, and I’m going to share a little more about it.

What is K8Guard?

K8Guard is an auditing system for Kubernetes clusters. It monitors different entities on your cluster for possible violations. K8Guard notifies the violators and then takes action on them. It also provides metrics and dashboards about violations in the cluster through Prometheus.

How to Pronounce It?

Like Kate Guard - the guardian angel for your Kubernetes clusters.

Why?

If you have large size kuberentes clusters and you care about security, efficiency, availability and stability, you need a tool to detect violations and do appropriate actions on them.

What Kind of Violations Does It Discover?

Violation	Why	Example
Image Size	Efficiency	5 GB image size
Image Repo	Security	Downloading image from a shady repo
Extra Capabilities	Security	Setting UID/GUID
Privileged Mode	Security	Root containers
Single Replica	Availability	Not 12-factor app
Invalid Ingress	Security/Stability	Having *” in ingress
Mount Host Vols	Security/Stability	Mounting Kubernetes system files
No Owner	Security	No owner annotation for namespace

What Kind of Entities Does It Monitor?

Any entities which deployed to your kubernetes cluster such as Deployments, Pods, Jobs/CronJobs, Ingresses and namespaces.

What Kind of Actions Does It Take?

Notifies the namespace owner (email, hipchat, …).
After X amount of notifications, it will do a hard action such as:
- Scale bad deployments down to zero.
- Suspend bad jobs.
- Delete bad ingress

Note that there is a safe mode - which only notifies and does not do hard actions.

The K8Guard Design

K8Guard has 3 main microservices (discover, action, report)

Discover service, when in messaging mode, finds violations and puts them on a kafka topic. and also discover API mode, is able to serve without depending on kafka. you can hit the end points to get JSON response.

Action service reads the violations off kafka and does action on them and records the actions in a database (Cassandra).

The same message will be sent to hipchat and tag the violators.

Report service will generate a human readable and searchable report of all the past violation actions.

Integration

K8Guard discover service has an API mode that you can use to integrate with other apps.

Example API Response:

[{
    Name: "dummy-deployment-name1",
    Namespace: "dummy-namespace-1",
    Cluster: "dummy-cluster-name1",
    Violations: [{
      Source: "dummy-repo/dummy-image:latest",
      Type: "IMAGE_REPO"
    }]
  },
  {
    Name: "ye-another-dummy-deployment",
    Namespace: "dummy-namespace-1",
    Cluster: "dummy-cluster-name1",
    Violations: [{
      Source: "another-dummy-repo/dummy-image2:latest",
      Type: "IMAGE_SIZE"
    }]
  }
]

Metrics

Discover API generates 19 metrics which are accessible at /metrics, and you can hook a monitoring system like Prometheus to collect them and then generate pretty Grafana dashboards.

Violation metric examples:

The number of all deployments.
The number of bad deployments.
The number of all pods.
The number of bad pods.

It also collects performance metrics for free. By free I mean, while K8Guard makes the api calls, it also measures how long it took, and makes it available as metrics.

Performance metrics examples:

The number of seconds took to return all images from Kubernetes api.
The number of seconds took to return all deployments from Kubernetes api.

Configurable

There are tens of configurations to customize what violations to discover and what to do with them. it also ships with a safe mode, that will only notify and won’t do any hard action.

What Tools and Technologies are Used?

Golang
Kafka
Cassandra
Prometheus
Memcached

Give It a Try, It Is Developer Friendly!

Batteries -are- included! You can easily run K8Guard locally on your computer and play with it. You can run all of the K8Guard system in either Minikube or docker-compose (your choice).

We have made easy Make commands for K8Guard. Don’t be shy if you see any issues or if you would like to contribute to the code. please do.

The best place to start is k8guard github page .

About the Author

Medya Ghazizadeh is a Senior Engineer and part of Target’s Cloud Platform Engineering team. He is the author of multiple open source projects such as Winnaker and K8Guard.

K8Guard was originally published by Target Brands, Inc at target tech on June 27, 2017.

2017-06-20T00:00:00-00:00

Here at Target, we run our own private OpenStack cloud and have never been able to accurately measure the performance of our hardware. This lack of measurement prevents the evaluation of performance improvements of new hardware or alternative technologies running as drivers inside OpenStack. It also prevents us from providing a Service Level Agreement (SLA) to our customers. Recently we have been striving to improve our OpenStack service which led us to talk to our consumers directly.

One of the major feedback points provided by talking with our consumers was the performance of the OpenStack cloud was lower than expected. Because we have not measured the performance of our cloud in the past, we have been unable to know if new hardware or configuration changes improves consumer-facing performance. With our new OpenStack environment builds we focused on changing this. But first we needed a tool to do the job.

Searching for a Tool

The first tool we looked at was Rally. Rally does performance testing of an OpenStack cloud. However, Rally focuses on the OpenStack API only. It is mainly used to test functionality (via Tempest) and stability of the API under large amounts of load. Rally does contain a resource to boot an instance and run Linux CLI commands via user data. This was tested as a way to provide the staging of instances to run performance software. However, starting the software on each instance at the same time and collecting the results from said software was difficult and not viable. Because of this, we deemed Rally was not suitable for our needs.

The next tool we looked at was KloudBuster. KloudBuster is a tool that does performance testing inside an OpenStack instance. At the time of writing it provides two sets of tests: HTTP and storage. The HTTP test uses a traffic generator to measure requests per second and latency between instances. The storage test uses FIO to measure read/write IOPs and bandwidth. KloudBuster does what we were looking for, measuring the performance of instances inside of OpenStack. However, it does not support adding more tests past the two included, it has limited configuration options for the environment setup, and it had stability issues inside our OpenStack cloud. Because of this, we deemed KloudBuster was not suitable.

Creating Our Own

With current options not meeting all of our needs, we decided the best option was to create our own performance framework that can be flexible to cover a wide variety of tests and environment setups. These were our requirements:

support the following tests: network, storage, and cpu
easily support adding future tests
flexible enough to support any OpenStack environment (including old versions)
ability to use fixed or floating IP addresses for connectivity
ability to use ephemeral or cinder storage
return test results in a format usable by other software

Enter CloudPunch

CloudPunch is a tool developed by the OpenStack team here at Target. It is completely open source and follows the MIT license. CloudPunch has the following features:

Written 100% in Python - CloudPunch is written in the Python language including the sections that stage OpenStack and the tests that run. This was chosen to avoid reliance on other tools.
Create custom tests - Because tests are written in Python, custom written tests can be ran by simply dropping a file in a folder. These tests are not limited; a test can do anything Python can do.
Fully scalable - A test can include one instance or hundreds. A couple lines of configuration can drastically change the stress put on OpenStack hardware.
Test across OpenStack environments - Have multiple OpenStack environments or regions? Run tests across them to see performance metrics when they interact.
Run tests in an order or all at once - See single metric results such as network throughput or see how a high network throughput can affect network latency.
JSON and YAML support - Use a mix of JSON or YAML for both configuration and results

How Does CloudPunch Work?

CloudPunch separates the process of running a test into three major roles:

Local Machine - The machine starting the test(s) and receiving the results outside of OpenStack.
Master - The OpenStack instance that is the communication between external and internal OpenStack. The local machine will send configuration to the master so the slaves can get it. The slaves send test results back the master so the local machine can receive them.
Slave - The OpenStack instance that runs the test(s). It reports only to the master.

For more specific information on these roles, see here

To better explain the process from start to finish, I will go over an example of running a simple ping test between instances. To initially start CloudPunch, I give the CLI a configuration file, an environment file, and an OpenRC file.

CloudPunch on the local machine then begins to stage the OpenStack cloud (in this order) with a security group, a keypair, the master router, the master network, the master instance, the slave routers, the slave networks, then finally the slave instances. The local machine now waits for all of the instances to be ready by checking in with the master server via a Flask API. At the same time, all slave instances are working to register with the master instance to provide host information and say that they are ready.

Once the master server and all slaves are registered and ready, the local machine sends the configuration to the master server and signals that the test can now begin. The slave instances are checking in with the master server every ~1 second for the test status to be ready. Once it is ready, the slaves pull down the configuration from the master server. The slaves then start the ping test by calling the ping.py file inside the slave directory. This file runs the ping command via the shell and captures the latency results. The slaves collect these results and once the test is complete, send these results back to the master instance. All while this is happening, the local machine is checking in with the master server every ~5 seconds for all slaves to have posted results to the master.

Once all slaves have posted results, the local machine pulls down the results and saves the results depending the configuration. Now that the process of running the test is complete, the local machine now deletes all of the resources it created on the OpenStack cloud. I am now left with the results of the ping test without having any resources sitting on the OpenStack cloud.

How Do I Get Started With CloudPunch?

See the getting started guide here

About the Author

Jacob Lube is an Engineer and part of Target’s Enterprise Cloud Engineering team focusing on anything OpenStack. Outside of OpenStack, Jacob enjoys riding bicycle, playing video games, and messing around in the Python language.

Measuring the Performance of our OpenStack Cloud was originally published by Target Brands, Inc at target tech on June 20, 2017.

2017-05-25T00:00:00-00:00

One of the strongest benefits of launching an application into the cloud is the pure on-demand scalability that it provides.

I’ve had the privilege of working with the ELK stack (Elasticsearch, Logstash, Kibana) for purposes of log aggregation for the past two years. When we started at that time, we were pleased with our performance on search and query times with 10’s of gigabytes of data in the cluster in production. When Peak time hit, we reveled as our production clusters successfully managed half a terabyte of data(!).

During peak, Target hosted 14 Elasticsearch clusters in the cloud containing more than 83 billion documents across nearly 100 terabytes in production environments alone. Consumers of these logs are able to get access to queries in blazing fast times with excellent reliability.

It wasn’t always that way though, and our team learned much about Elasticsearch in the process.

What’s The Use Case At Target?

In a word, “vast.” The many teams that use our platform for log aggregation and search are often times looking for a variety of things.

Simple Search

This one is easy, and the least resource intensive. Simply doing a match query and searching for fields within our data.

Metrics / Analytics

This one can be harder to accommodate at times, but some teams use our Elasticsearch clusters for near-realtime monitoring and Analytics using Kibana dashboards.

Multi-tenant Logging

Not necessarily consumer facing, but an interesting use for Elasticsearch is that we can aggregate many teams and applications into one cluster. In essence, this saves money over individual applications paying for infrastructure to log themselves.

Simple search is the least of our concerns here. Queries add marginal load on the cluster, but often they are one-offs or otherwise infrequently used. However, the largest challenge faced here is multi-tenant demand. Different teams have very different needs for logging/metrics; designing a robust and reliable ‘one-size-fits-all’ platform is a top priority.

A Few of Our Favorite Things

At Target, we make great use of open-source tools. We also try to contribute back where we can! Tools used for our logging stack (as far as this post is concerned) include:

Elasticsearch, Logstash, Kibana (ELK stack!)
Apache Kafka
Hashicorp Consul
Chef
Spinnaker

In The Beginning…

When the ELK stack originally took form at Target, it was with Elasticsearch version 1.4.4. While we have taken steps to manage and update our cluster, our underlying pipeline has remained.

To get logs and messages from applications, we have a Logstash client installed to forward the logs through our pipeline to Apache Kafka.

Kafka is a worthwhile addition in this data flow, and is becoming increasingly popular for it’s resiliency. Messages are queued and consumed as fast as we can index them via a Logstash consumer group and finally into the Elasticsearch cluster.

While we were impressed and happy with our deployment, the initial Elasticsearch clusters were simple by design. A single node was responsible for all parts of cluster operations, and we had clusters ranging from 3-10 nodes as capacity demands increased.

The Next Step

The initial configuration did not last very long; Traffic and search demands rose with increased onboarding of cloud applications, and we simply could not guarantee reliability with the original configuration.

Before our peak season in 2015 we unleashed a plan to deploy the universally accepted strategy of using dedicated master, data, and client nodes.

Separating out the jobs (Master, Client, Data) is a very worthwhile consideration when planning for a scalable cluster. The main reason is that, when running Elasticsearch, heap size and memory is your main enemy. Removing extra functions from the process and freeing up more heap for a dedicated job was the single most effective thing that could be done for cluster stability.

Deployment strategy

With the initial cluster, we used a Chef pipeline tied into the cloud by an in-house developed deployment platform. For the 2016 season, we ended up shifting to a deployment pattern that moved away from cookbooks and Chef completely, instead being replaced with an RPM styled deployment via Spinnaker.

RPMs were a huge turning point for the team. The average time to build an Elasticsearch data node, especially one with HDDs, was above 15 minutes with Chef. Emergency adjustments and failures were hard to respond to with any speed. Deploying an RPM via Spinnaker, we were able to deploy the same image to the same instance types in less than 5 minutes; Instances with SSDs would be active in even less time than that.

Spinnaker made deployments dead simple to provision, entire environments with all the components could be set to trigger at the press of a single button, a welcome change from the unfortunately delicate and time-consuming process that it replaced.

Yet another major change for us was segmenting out high-volume tenants. Spinnaker made this much, much easier than it could have been with our system we had with Chef, but the real star of this show is Hashicorp’s Consul. Our team was able to stand up logging clusters for tenants and shift the log flow to the correct clusters seamlessly and with no impact to our clients using consul-templates and Consul key/value stores.

Our Final Form?

Elasticsearch at Target has gone through some changes since our Peak season in 2015. We now utilize Elasticsearch 2.4.x in production, enhanced Logstash configurations, and highly scalable, dynamic clusters. Much of this has been made possible with Hashicorp’s Consul and use of their easily managed Key/Value store.

We also broke out and experimented with Data node types. Using a system of “Hot/Cold” we have been able to save on infrastructure costs by swapping older, less searched data to HDD, and keeping fresh, often searched data on SSD storage.

As stated before, during peak holiday load, we dealt with over 83 billion docs in our production environments, and a single production cluster often hosts over 7500 shards at a time. Our indices are timestamped by day, such that we can easily purge old data and make quick changes to new data.

Deployment strategy changes were key. Our logging platform endured the 2016 holiday peak season with 100% uptime. On-call schedules were much easier for us this year!

Some Lessons Learned

Again, RAM/Memory is the main enemy. A cluster will frequently run into Out of Memory errors, GC locks, etc. without a good supply of heap.
Dedicated Masters. Just do it. It’s passable to combine Kibana/Client nodes, or Data/Client nodes in small environments, but avoid this pitfall and just dedicate masters for ELK.
31gb is a good heap size for your data nodes. 32gb heap will force 64-bit JVMs, not efficient for this purpose.
Using Elasticsearch plugins, such as Kopf, will make your life visibly easier. (Mind, this isn’t supported out of the box if you go with Elasticsearch 5)
Get a cheat sheet together of your most frequently used API commands. When setting up a cluster, chances are good you will be making frequent tweaks for performance.
Timestamp your indices if your traffic is substantial, reindexing to allow for more shards is painful and slow.
Consumers of your logs (customers) aren’t always right, be mindful of logstash configurations on forwarders to avoid indexing errors in Elasticsearch.

In Closing

Final form? Of course not! There’s always more to do, and our team has some exciting things planned for the future of logging and metrics at Target.

About The Author

Lydell is a Sr. Engineer and all-around good guy, most recently working with the API Gateway team at Target. When he’s not tinkering with the gateway, you can probably find him playing with his band (Dude Corea), playing the latest video game, or trying to skill up his Golang abilities.

Target and Elasticsearch: Maintaining an ELK stack over Peak Season was originally published by Target Brands, Inc at target tech on May 25, 2017.

2017-05-23T00:00:00-00:00

It is 7:30 AM on a Monday morning in late October. I am waiting in line at Cafe Donuts to bring my team breakfast for our mandated ‘no work for one hour’.

We just wrapped up a strenuous week of implementing a major upgrade to our Elastic logging cluster. Many digital teams are relying on this upgrade to position themselves to confidently monitor their application health during the most important day in retail - Black Friday. It was a successful, much anticipated upgrade that resulted in many hours of overtime, late night calls, and cross-team performance tests. Morale is high, but it hasn’t always been.

It has been over three months since the team disassembled to work on new initiatives. I have since had a chance to reflect on the qualities of the team that made it so unique. I can’t speak for my colleagues, but the accomplishments we made over six months are my proudest achievements in my five year tenure at Target.

Leadership support

Our leaders truly supported and enabled our work to be successful. They gave us the resources and confidence to confront organizational blockers and competing priorities. They helped us celebrate wins by providing recognition through updates to high level executives and advertising our achievements in forums. They took risks for us, trusted us to make decisions, and went to bat for us even when there was doubt that we could deliver on our promises.

Respect

This quality is near and dear to my heart. I never witnessed competition, resentment, distrust, or judgement among the team. Everyone truly respected one another and their unique perspectives. We were all aligned, working towards the same goals, and were in it together. Lead engineers partnered with junior engineers to ensure they understood the entire tech stack and could independently support our customers. I personally felt like my opinions and ideas mattered. On call schedules were set, but we supported one another through escalations and incidents. As a primary on call member, you knew you were never alone. There were countless occurrences where engineers picked up shifts when they knew someone had to focus on life events or just needed a break. It was obvious that team members were proud of one another. I found myself bragging constantly in leadership updates and even to my family and friends about the uniqueness of our team.

Visibility

Four different product owners rotated in and out of the team in a nine month span. At times, it felt like a constant, disruptive setback. But in hindsight, it enabled each team member to feel more connected to our priorities and customers. We were forced to share responsibility of visits and check ins with application teams. It is much easier to motivate a team when they can deeply empathize with their customer. The entire team built and prioritized our backlog instead of a Product Owner writing and maintaining tasks to hand off to the team. When our final Product Owner joined the team, he humbly respected the team dynamics and wholeheartedly advocated for our work.

People over process

My favorite agile ‘principle’. Our team quickly iterated through several methodologies, tools, and agile techniques. We kept the pieces that worked for us and ditched the ones that hindered our ability to move fast. We never considered processes or ceremonies sacred, but instead used them as tools to improve our productivity. We learned our colleagues’ working styles and how to best communicate with one another. We typically picked up 1-2 ‘features’ at a time and swarmed until they were finished. Engineers partnered on tasks that interested them, but also the non-sexy ones that they knew just had to get done. We met daily (most often in person) to share updates, blockers, and learnings. We talked often about frequent support incidents and aligned our priorities to fix them. We also talked about other things - Matt’s cat, Lydell’s band, Shawn’s daughter, Aaron’s hockey games, Jonathon’s inability to wink, Mickie’s hair color of the week, or the number of CD-ROM discs that we could fill with an audio version of someone reading all of the application logs our platform could ingest on a daily basis. The mutual respect and trust allowed us to connect on a personal level. This is an aspect of a successful team that I believe to be vitally important. Fostering the human side of software development - or any job that requires teamwork - is vital in maintaining healthy, productive, and happy teams.

Be your authentic self

This one speaks mostly to my role as a Scrum Master. I consider myself somewhat of a perfectionist in my professional career. I strive to meet the expectations of my role in the organization; attempting to exceed the job description. I find the Scrum Master role to be very tough to measure. What do my day to day tasks look like? How can I measure my success? Over the past couple years, I have learned that it isn’t possible to concretely answer these questions. There aren’t a list of steps in a Project Plan to follow to produce the #dreamteam. You also cannot fulfill the role by simply setting up and facilitating meetings. Instead, it requires working through a vast amount of interactions every day to enable a team to perform at its full potential; consistently delivering quality software. It is about each member feeling proud of what they achieved and ensuring that the team continues to improve its way of working. I realize that I feel most successful when I remove my expectation of trying to mirror a perfect ‘mold’ of the role. Instead, I try to be myself, leveraging my experiences and strengths to make the best possible decisions to benefit my team. Fortunately, I naturally yearn to help people and make them happy so the role fits my personality quite well.

Passion for technology

I am very fortunate to work among patient, caring, and inclusive individuals. As a naturally curious person, I strive to understand as much as I can about the technologies that my teams use and build. Without hesitation, everyone was extremely supportive in my desire to learn and contribute to our product. As we worked through technical design and strategic decisions, they explained in detail; encouraging me to ask questions and share my thoughts. Eddie Roger, one of our Principal Engineers and a personal mentor, was instrumental in establishing my confidence to develop a true passion for technology and even roll up my sleeves to build and deploy changes to our platform. Armed with this support and experience, I have been able to expand my role as a Scrum Master; recognizing the team’s pain points, posing insightful questions, and effectively communicating with our stakeholders and partners. My colleagues’ gracious willingness to teach me has allowed me to feel truly connected to the vision and future of our products.

Surviving (and thriving) Through Peak Season 2016 on the Digital Observability Team was originally published by Target Brands, Inc at target tech on May 23, 2017.

2017-05-22T00:00:00-00:00

Hadoop upgrades over the last few years meant long outages where the Big Data platform team would shutdown the cluster, perform the upgrade, start services and then complete validation before notifying users it was ok to resume activity. This approach is a typical pattern for major upgrades even outside Target and reduces the complexity and risks associated with the upgrade. While this worked great for the platform team, it was not ideal for the hundreds of users and thousands of jobs that were dependent on the platform. That is why we decided to shake things up and go all in for rolling maintenance.

Cluster Details:

Hadoop (Core) 2.7.1 to 2.7.3
Mixed cluster workload running Hive, Tez, MR, Spark, Pig, HBase, Storm

Goal

Our March 2016 Hadoop upgrade was the turning point for rolling maintenance. With a large outage for the upgrade and monthly maintenance windows leading up to it, we decided to challenge ourselves with rolling maintenance to reach our uptime goals for core components. This would allow the platform team to deploy changes faster, reduce maintenance risk by not bundling changes together every month and more importantly not impact users with planned downtime. Drawing the line in the sand for rolling maintenance meant that it was time to get to work on our upgrade strategy.

Reviewing the Playbook

We started reviewing the upgrade process in November 2016 with the goal to upgrade our first admin cluster in December. The short turnaround time meant that we would leverage our cluster administration tool to handle the upgrade orchestration. The focus shifted towards understanding the order of events and evaluating the process for potential impacts. Running through the upgrades required a quick way to iterate through Hadoop cluster deployments and then tearing them down to retest. This is where REDStack, which is our internal cloud provisioning tool to build out a secure Hadoop cluster with production like configurations, came into play. After going through the rolling upgrade process a few times, we knew right away there were a few areas we needed to spend time on.

Modifying the Game Plan

The first adjustment we made was to minimize the HiveServer2 (HS2) port change disruption. Having that restart under a new port and run that way for multiple days while the upgrade finished would have caused a major impact for a lot of applications. What we did was modify the stack upgrade XML file so we could pause the upgrade to shutdown HS2 instances right before the HS2 binary upgrade and restart. We could then resume the upgrade process to start them back up under the new version on the same ports. This created a second grouping for Apache Hive components in order to bring our HS2 impact down from days to a few minutes for a quick restart.

Hive Modifications:

   <group name="HIVE" title="Hive">
      <skippable>true</skippable>
      <service-check>false</service-check>
      <supports-auto-skip-failure>false</supports-auto-skip-failure>
      <service name="HIVE">
        <component>HIVE_METASTORE</component>
        <component>WEBHCAT_SERVER</component>
      </service>
    </group>

    <group name="HIVE" title="HiveServer2">
        <skippable>true</skippable>
        <service-check>true</service-check>
        <supports-auto-skip-failure>false</supports-auto-skip-failure>
        <service name="HIVE">
            <component>HIVE_SERVER</component>
        </service>
    </group>

Scouting the Opponent

With confidence increasing, the next item we needed to address was how end user jobs would react during the upgrade knowing we had hundreds of servers to roll through. This required us to dig into the distributed cache settings where we realized we had a couple options to set the MapReduce classpath:

Include an archive that contains the MapReduce, Yarn, HDFS and Hadoop common jars and dependencies
Include an archive that just contains the MapReduce jars and force it to source the remaining dependencies from your local filesystem on each node that has your Hadoop software

Going with the first option meant that you would see a reference to ‘$PWD/mr-framework/hadoop’ in your mapreduce.application.classpath. This referenced a path to the jars and since the jars were readable by all, the localizer will run inside the address space of the YARN NodeManager to download and not in a separate process (private/application specific). To configure this, we modified the following properties:

mapreduce.application.framework.path - Path to the MapReduce tar, referenced by the current version that is active on each node

/hadoop/apps/${version}/mapreduce/mapreduce.tar.gz#mr-framework

mapreduce.application.classpath - MR Application classpath which references the framework path above

$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hadoop/${version}/hadoop/lib/hadoop-lzo-0.6.0.${version}.jar:/etc/hadoop/conf/secure

After these two settings were in place, they were pushed out with a rolling restart to make sure the mapreduce-site.xml files were updated and were ready for testing. During the Core Master phase for the rolling upgrade, the restart of Mapreduce2/HistoryServer2 step caused an upload of the latest tarball referenced above. The following output was observed which shows the HDFS upload:

2016-12-29 16:05:37,076 - Called copy_to_hdfs tarball: mapreduce
2016-12-29 16:05:37,076 - Default version is 2.7.1
2016-12-29 16:05:37,077 - Because this is a Stack Upgrade, will use version 2.7.3
2016-12-29 16:05:37,077 - Source file: /hadoop/2.7.3/mapreduce.tar.gz , Dest file in HDFS: /hadoop/2.7.3/mapreduce/mapreduce.tar.gz

To make sure our classpath is setup appropriately we started a test job to validate. From the Resource Manager UI the AppMaster logs were searched for, ‘org.mortbay.log: Extract jar’. This resulted in a similar entry as the following:

2016-12-29 17:07:17,216 INFO [main] org.mortbay.log: Extract jar:file:/grid/0/hadoop/yarn/local/filecache/10/mapreduce.tar.gz/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.7.1.jar!/webapps/mapreduce to /grid/0/hadoop/yarn/local/usercache/matt/appcache/application_1482271499284_0001/container_e06_1482271499284_0001_01_000001/tmp/Jetty_0_0_0_0_46279_mapreduce____.jzpsrd/webapp

During the rolling upgrade our admin tool used the active version based on the node upgrades. From a timing standpoint it held on until the last core service was upgraded, which happened to be Apache Oozie which then ran the client refresh. After that, the active version becomes 2.7.3 which allowed new jobs submitted to run under the new version. All existing jobs ran to completion using 2.7.1.

Final Practice

With our prep work completed and dozens of cloud cluster iterations for practice, we were ready to run through our admin cluster. The admin cluster upgrade went well but had a stumbling point with the Hive metastore during the restart. After identifying the issue, we realized we need a fast way to compare configurations across our clusters to proactively identify differences that may cause more problems. We threw together a quick Python script to help parse configurations from our admin tool’s API, compare values between clusters and write out configurations that did not match.

Example API Code:

# Get current desired configurations from source cluster
def get_source_configurations():
    print "Gathering configuration files for source cluster..."
    try:
        source_cluster_URL = "http://" + source_cluster_host + ":" + source_cluster_port + "/api/v1/clusters/" \
                        + source_cluster + "?fields=Clusters/desired_configs"
        response = requests.get(source_cluster_URL, auth=HTTPBasicAuth(source_user_name, source_user_passwd))
        json_data = json.loads(response.text)
        response.raise_for_status()
    except requests.exceptions.HTTPError as err:
        print err
        sys.exit(1)

# Create dictionary of config file names/versions for source cluster
def get_source_version_details():
    print "Gathering version details for configuration files for source cluster..."
    source_desired_confs = {}
    for config_file in json_data['Clusters']['desired_configs']:
        source_desired_confs[config_file] = json_data['Clusters']['desired_configs'][config_file]['tag']

After running the configuration checks between active and new cloud clusters we identified a handful of properties that needed to be updated to help us avoid issues with the remaining clusters. With the changes in place we were ready for the stage cluster.

With dozens of hosts this would be our first true test to a new upgrade process with end user jobs. The stage cluster upgrade avoided the same issues as the admin cluster and resulted in hundreds of jobs running successfully for the six hour upgrade window. Even though the process went well, we still noticed an area of opportunity to help reduce Apache Storm downtime. Stopping all running topologies resulted in an additional 10-15 minute delay as we relied on manually killing them. To get Storm downtime under our goal of 10 minutes (full restart required due to packaging updates), we leveraged the Storm API to identify running topologies and then kill them.

Example Storm API Code:

# Get topology list
def get_topologies():
    print "Gathering running topologies..."
    try:
        storm_url = "http://" + storm_ui_host + ":" + storm_ui_port + "/api/v1/topology/summary"
        response = requests.get(storm_url)
        data = json.loads(response.text)
        response.raise_for_status()
        return data
    except requests.exceptions.HTTPError as err:
        print err
        sys.exit(1)

Now that we could identify and kill topologies within a minute or so, we knew we were ready for the big game.

Slow Start to the First Half

The prior two cluster upgrades were good tests, but did not come close to the scale of our production cluster. With hundreds of nodes, 10’s of PB of raw HDFS storage and complex user jobs across multiple components, we knew we were in for a challenge. Based on the admin and stage cluster upgrades we put together estimates that had us completing the upgrade in around 44 hours. With the team assembled in a war room we were ready to start, or so we thought. The size of the upgrade caused issues with our admin tool generating the upgrade plan which resulted in a 5 hour delay while we recovered the database and fixed the initial glitch. By the time we got into the rolling restarts for slave components (HDFS DataNodes, Yarn NodeManagers, HBase RegionServers) we were already behind. After watching the first handful of sequential restarts we quickly realized our goal of 44 hours was looking a lot more like 143 hours putting us days behind. Outside of the delays we also ran into a bug with DataNodes failing to restart. With a new DataNode service restart every five minutes, the risk for data loss if multiple nodes were down was the primary concern going into the night. To reduce the risk and help proactively restart, a quick bash script was put together and scheduled in cron to help catch nodes that had failed.

Example Bash Script:

check_wait_restart() {
  echo "Checking Datanode..."
  PID=$(pgrep -u hdfs -f DataNode | head -1)
  if [ ! -z "${PID}" ]; then
    echo "Datanode is running."
    return 0
  fi

  echo "Datanode is stopped, sleeping..."
  sleep 120

  echo "Checking Datanode..."
  PID=$(pgrep -u hdfs -f DataNode | head -1)
  if [ ! -z "${PID}" ]; then
    echo "Datanode is running."
    return 0
  fi

  echo "Datanode is still down after 2 minutes, restarting..."
  export HADOOP_LIBEXEC_DIR=/usr/hadoop/2.7.3/hadoop/libexec
  /usr/hadoop/2.7.3/hadoop/sbin/hadoop-daemon.sh --config /usr/hadoop/2.7.3/hadoop/conf start datanode
}

Halftime Motivation

Running days behind and the risk of HDFS storage continuing to grow (10% jump in the first day), we knew we needed to speed things up. We identified two paths forward, one relying on a patch from our vendor’s engineering team to get us through the slave restarts faster (query optimization) and then the second looking at Chef to help complete the restarts. Work started on both immediately with the Target team jumping into the Chef work and knocking out templates for services and adding recipes to duplicate the admin tool logic. After a couple of hours news came that an admin tool hotfix was ready and the end result dropped our slave restart time by 75%, putting us back on the original forecasts (60 seconds per node). Focus shifted from the slave restarts over to the client refreshes, which were also sequential tasks by node. We determined that we would manage the client refreshes outside the admin tool, so we could complete within hours to get us back on schedule.

Second Half Comeback

Days behind with the client refreshes coming up meant taking a risk to get back into the game. Shell scripts were developed to handle the logic of checking/creating configuration directories, gathering and deploying service files and updating the active versions to cutover to the new version.

Example Bash Script:

# verify that conf dirs are in correct places
for svc in hadoop zookeeper tez sqoop pig oozie hive hive-hcatalog hbase ; do

  # Creating Conf Directories
  if [ ! -d "/etc/${svc}/${STACK_VERSION}/${CONF_VERSION}" ]; then
    echo "Creating ${STACK_VERSION} conf dir /etc/${svc}/${STACK_VERSION}/${CONF_VERSION}..."
    /usr/bin/conf-select create-conf-dir --package "${svc}" --stack-version ${STACK_VERSION} --conf-version ${CONF_VERSION} || :
  fi

   /usr/bin/conf-select set-conf-dir --package "${svc}" --stack-version ${STACK_VERSION} --conf-version ${CONF_VERSION} || :
done

After some quick sanity checks in our lower environment we determined that we were ready. The admin tool was paused and iterations of deploying client configurations were started. The workaround went great and client refresh times were reduced by 95%, down to four hours from updated estimate of 67 hours. Finalizing the upgrade afterwards was the whistle ending the marathon event.

Instant Classic

Completing the largest single production cluster rolling upgrade in the history of one of the largest Hadoop vendors with 99.99% uptime was a great team accomplishment (internal and external). The months of preparation that went into the upgrade and the game time decisions throughout paid off. Our final duration clocked in at just over 41.5 hours, hours ahead of our goal time and days ahead of the modified duration if no action was taken. During this time, we had thousands of jobs running, supporting multiple key initiatives and influencing business decisions. The engineering talent on the team was the ultimate reason for our success. The right mix of skills and personalities allowed us to react throughout the entire event. The other important lesson learned that was mentioned above but deserves more focus, is that a rolling upgrade takes time. While HDFS is being upgraded blocks of data will not be deleted until the upgrade is completed and HDFS is finalized. Make sure you have sufficient HDFS storage to avoid the risk of running out of space during your upgrade. With our major upgrade out of the way, we are excited to focus on the next chapter and continue to raise the bar.

About the Author

Matthew Sharp is a Principal Data Engineer on the Big Data team at Target. He has been focused on Hadoop and the Big Data ecosystem for the last 5+ years. Target’s Big Data team has full stack ownership for multiple open source technologies. The team manages hundreds of servers with 10’s of PBs of data using an advanced CI/CD pipeline to automate every change. Stay tuned for future team updates to share how we do Big Data at Target!

Hadoop Rolling Upgrades was originally published by Target Brands, Inc at target tech on May 22, 2017.

2017-04-07T00:00:00-00:00

Background

Just after the middle of last year, Target expanded beyond its on-prem infrastructure and began deploying portions of target.com to the cloud. The deployment platform was homegrown (codename Houston), and was backed wholly by our public cloud provider. While in some aspects that platform was on par with other prominent continuous deployment offerings, the actual method of deploying code was cumbersome and not adherent to cloud best practices. These shortcomings led to a brief internal evaluation of various CI/CD platforms, which in turn led us to Spinnaker.

We chose Spinnaker because it integrates with CI tools we already use at scale (Jenkins), supports deploying to all major public cloud providers, and compels software deployment best practices – all deployments are performed via immutable images, a snapshot of config + code.

Supporting a Platform

The primary goal of Target’s cloud platform is to enable product teams to deploy and manage their applications across multiple cloud providers. We provide CI/CD, monitoring, and service discovery as services, and any application deployed via our platform gets those capabilities via a base image that is pre-configured for connectivity to each service’s respective endpoint.

Since these components are essentially products we provide to internal customers, we had to ensure the new CD platform was operationally supportable and highly-available. So, as soon as we decided on Spinnaker, a handful of engineers from the Cloud Platform group set about making this happen.

Default Spinnaker scripts make it easy to standup a single self-contained server with the microservices and persistence layer all together, but that wasn’t conducive to doing blue-green deployments – allowing updates of Spinnaker without downtime to our internal customers.

We built jobs for building packages based off the master branch of each Spinnaker component’s upstream git repository, and wrote Terraform plans to manage the deployment of each stack. We set about making each component as resilient as possible. Front50, the Spinnaker component responsible for data persistence, uses Cassandra by default. Managing a Cassandra ring added too much overhead just to maintain configuration of Spinnaker, so we borrowed a play from Netflix and configured it to persist configuration in cloud storage instead. We’re also using the cloud provider’s cache option instead of having to manage our own highly-available Redis cluster.

Overcoming challenges

We had our fair share of challenges along the way. We were running into various rate-limiting and throttling issues when hitting the cloud provider APIs. Spinnaker makes a lot of API calls in order to have a responsive user experience, but the cloud provider will only allow a small number of API calls per minute. Fortunately, Netflix also wrote a tool, Edda, to cache cloud resources. We configured Spinnaker to use that instead of a direct connection to the cloud provider’s API, which seems to be scaling much better.

The other major challenge we faced was how to handle baking images for multiple regions. Originally we had configured Spinnaker to bake in our western region and then copy to east. That was SLOW. It would bake the image in about 5 minutes, but it would take 20-40 minutes to copy it. Fortunately, Spinnaker supports parallel multi-region baking – taking the same base OS image in each region and installing the same packages on them, which in theory should result in the same config+code. Unfortunately, the way Netflix implemented it would work if we had multiple regions in the same account, but not if we had separate accounts per region, which is how Target operates. One of our engineers found a work around, and ultimately worked with an engineer at Netflix to get a more elegant solution incorporated upstream. Now, because we can bake the images in parallel, baking images takes 5 minutes instead of 25-45 minutes. A big improvement.

OpenStack Driver

One of the reasons we chose Spinnaker was its pluggable architecture, and we knew early on that we would put that to good use. Target has a considerably sized internal OpenStack environment and we wanted to be able to leverage Spinnaker to deploy there, so we started the development of a native OpenStack clouddriver.

The process worked exactly how contributing to an open source project should work. We asked the community if anyone else was interested in collaborating, and Veritas Corporation, who also needed OpenStack support in Spinnaker, had we set to work with several of their engineers. We met with core Spinnaker engineers from Netflix and Google, and they asked us to submit pull requests directly against the master branch in the public github repository in order to get rapid feedback on the changes we were making.

A small group of engineers began to spec out the work and begin development in late May, and at the end of September the driver reached what we would call a stable state.

Autoscaling in OpenStack

During development of the OpenStack driver, we ran into an issue with the way autoscaling is implemented in Heat (an OpenStack orchestration engine to launch multiple composite cloud applications based on templates in the form of text files that can be treated like code).

First, the APIs for load balancers didn’t support automatically adding instances to a member pool until the Mitaka release of OpenStack. We were running an earlier version, so our private cloud engineers swarmed and quickly upgraded our environment to Mitaka.

Second, and more troublesome, Heat doesn’t track disparity between the desired instance count in a scaling group, and the actual count. As a result, autoscaling in OpenStack isn’t really auto; it requires some kind of intervention. To work around this, we updated the driver to mark the server group unhealthy in Heat when it detects a discrepancy between desired and actual.

What’s next?

We’re extremely proud to share the OpenStack driver in Spinnaker with the community, and we hope that any organization that is using OpenStack will leverage the new driver to enable immutable deployments and autoscaling in that environment.

We’re currently growing our use of container-based deployments via Kubernetes, both internally, as well as on public cloud providers. The presence of a driver for Kubernetes in Spinnaker will enable us to facilitate deployments to any or all of the k8s clusters we’re running. Look forward to learning more about our Kubernetes deployment and consumption in a future techblog post.

How (and Why) We Moved to Spinnaker was originally published by Target Brands, Inc at target tech on April 07, 2017.

2017-04-05T00:00:00-00:00

Target’s open source big data platform contains a vast array of clustered technologies or ecosystems working together. Troubleshooting an issue within a single ecosystem is a difficult task let alone an issue that spans several ecosystems.

It is impractical for a single human to individually investigate ecosystems one at a time for potential problems. The house will burn to the ground long before an engineer can find the cause of an issue and resolve it without quick access to aggregated system metrics and logs.

The Solution

How to identify, troubleshoot and resolve a distributed issue? Fight fire with fire of course! Big data issues must be solved with big data solutions.

At Target, we are constantly expanding our Distributed Troubleshooting Platform to encapsulate every log and metric from every service in every ecosystem of our big data platform. Aggregating this data into a single troubleshooting platform enables an engineer to view error logs and system metrics across hundreds of machines and services with a single click.

A troubleshooting platform like the one described above is not a new idea. Systems like Splunk have been doing it for years. Splunk however, has restrictions on the amount of data that can be ingested without an enterprise license. The larger we scale; the more money we pay for systems like Splunk.

We created our Distributed Troubleshooting Platform from open-source components and without enterprise licenses. This allows us to utilize it on every server in the big data platform without worrying about the volume of data it is processing and re-negotiating enterprise licenses. It becomes a given, not a variable.

Our Distributed Troubleshooting Platform is similar to the black box recorder on an aircraft. A majority of the time, the contents are never viewed. When the plane crashes however, the contents of the black box are the only way to reconstruct what happened and learn from the incident.

Running a big data platform without enterprise licenses comes at a cost. There is no enterprise support or “throat to choke” if something goes wrong. Having all the system metrics and logs available as quickly as possible becomes crucial to resolving the issue quickly.

The Distributed Troubleshooting Platform at Target

We chose the Elasticsearch, Logstash and Kibana stack (what was the ELK Stack is now the Elastic Stack). The Elastic Stack out of the box does not do a whole lot, but what it provides is a bottomless toolbox of components to build anything imaginable.

For those not familiar with the Elastic Stack, here is a brief description of the key components:

Elasticsearch – A pluggable horizontally scalable distributed search engine based on Lucene.
Logstash – A pluggable ingestion, transformation and publishing tool.
Kibana – A pluggable discovery and visualization dashboard Node.js webserver.

The keyword in all of the above is pluggable. The Elastic Stack components are great out of the box but when augmented with various home brewed plugins and the incredible availability of community plugins, they become unstoppable.

Requirements

The requirements for our Distributed Troubleshooting Platform:

Enterprise license-free
Able to ingest and visualize any log of any type
Able to ingest and visualize any metric of any type
Deployment must be automated
Highly-available and horizontally scalable
Secure, end-to-end SSL and authorization (Search Guard, an open-source, pluggable security suite for Elasticsearch that does not require an enterprise license)

Enough talk, time for a real world example

The problem – our Distributed Troubleshooting Platform worked great up until about 600 servers were sending data to it. Amongst these servers were Hadoop services, MySQL, Storm, Elasticsearch, Chef and a whole host of other services.

The Elasticsearch cluster handling the events was increasingly dropping events for every new node added to the big data platform. At first this was not concerning because a majority of the events were still flowing enough to troubleshoot and fight fires.

We started hooking up Nagios to our Distributed Troubleshooting Platform to monitor service metrics and immediately noticed a problem. The data quality of the service metrics was poor. Although they were still good enough to look at from a one-hour view in Kibana, they were not sufficient for notifications. There were random gaps in the data, sometimes as large as 10 minutes. These gaps would trigger false alarms from Nagios.

Getting to the bottom of the Issue

The natural place to start is at the Logstash level. It seemed as if Logstash was not configured correctly or the events would not be dropping. Numerous iterations of Logstash configuration changes yielded little to no improvement in dropped events. The issue was growing worse by the day. Without any metrics on the Elasticsearch cluster, the only scientific validation of the Logstash changes would be whether or not the events were still dropping.

Not very scientific, but being an engineer, I had 100 other things to do than instrument Elasticsearch, which up until this point had never given me a problem.

After countless failed attempts to figure out the issue with Logstash, I decided it was time to instrument Elasticsearch. I needed better instrumentation to see what the true impact of each change was on the system. Guessing was not working.

Instrumenting Elasticsearch

I wanted to know absolutely every metric for Elasticsearch if I was going to solve this once and for all. The following endpoints provided what I needed:

https://localhost:9200/_nodes/stats
https://localhost:9200/_cluster/stats
https://localhost:9200/_cluster/health

Here is a small example snippet of some data I was interested in from one of the above endpoints:

Those endpoints contain parent-child relationships. Similar to the JVM metrics above. Kibana does not like parent-child relationships. I cannot ingest the JSON natively because of this. Those endpoints also contain 100+ other metrics I was not interested in. It would introduce a monsoon of fields into the Distributed Troubleshooting Platform to ingest the entire endpoint JSON natively without pruning the contents.

I needed to pull out the bits and pieces I was interested in, flatten out the JSON and combine the three endpoints into an uber JSON document. This provides a single document containing a snapshot of all important metrics for a given moment.

Here is the same data (only the bits I cared about) from above flattened out into something Kibana work with:

Since I was emitting JSON documents, which my friend Logstash loves, ingesting them to Elasticsearch was trivial:

The elastic-metrics Dashboard

Almost immediately I saw what I thought was the problem. I looked at the “elastic-thread_pool_bulk-host” visualization and saw the bulk queue for one of my Elasticsearch servers was 2,500 bulk events deep while the rest were hardly 20. Like any engineer I restart that instance and think “yeah there we go, must have been a glitch with that instance”. The restart of that server can be seen as the bright pink line that drops off around 8:00am in various charts above.

To my great disappointment, I made the problem even worse. The aggregate bulk queue length sky rocketed.

The bulk queue being backed up is a symptom, not the problem.

Back to the elastic-metrics dashboard to look for anything else that did not make sense. I turn my attention to the “elastic-http_current_open” visualization. I notice something odd. I should not have 2,500 HTTP connections to the Elasticsearch cluster. We had at that time about 600 nodes using the Distributed Troubleshooting Platform. I should have about 600 HTTP connections.

Turns out, one of my pre-mature optimization choices for the number of Elasticsearch output workers (4) in a Logstash agent was quadrupling the number of HTTP sessions coming from every node. Instead of one large bulk request with up to X number of events, each agent was sending 4 smaller bulk requests of X/4 events.

I quickly pushed out a change to every server in every ecosystem without failure with my good friend Chef. The change adjusts the output workers to 1 and bumps the batch size from 1000 to 5000. I get 3 less workers (3 less HTTP connections) and 1000+ max events per agent with this new configuration. Excellent.

Immediately I start to notice the HTTP sessions lowering around 10:00am and ingest starting to pick up. Dropped events are happening less and less. The bulk queue is going down. Time to get some beers.

Scaling a distributed system is hard and full of mistakes

Way back when we had just a handful of servers, I figured why not have more than one Elasticsearch output worker on my Logstash agents. I want MORE POWER all the time! This scaled great until it did not.

Every node added to the Distributed Troubleshooting Platform was as if we added four clients to Elasticsearch.

I successfully used the Distributed Troubleshooting Platform to troubleshoot why Distributed Troubleshooting Platform was not working.

Log aggregation and service metrics in an open-source big data platform are invaluable when running without enterprise licenses and enterprise support contracts. Engineers need a platform that is quickly augmentable, expandable and configurable to stay on top of all the various technologies and their quirks. There is no enterprise support to call when something goes wrong. The big data engineers own the platform and are held responsible for it when something malfunctions. Creating a ticket for enterprise support and making it someone else’s problem is not an option.

We solve issues in our ecosystems like the one described here on a daily basis thanks to the Elastic Stack and a bit of elbow grease from our engineers. Our Distributed Troubleshooting Platform currently has ~250 service metrics and ~40 different log types being ingested at ~1TB per week un-replicated.

Distributed Troubleshooting was originally published by Target Brands, Inc at target tech on April 05, 2017.

2017-02-13T00:00:00-06:00

Win the cloud with Winnaker!

I am happy to announce that we, at Target, decided to open source a tool called Winnaker. This tool will allow the user to audit Spinnaker from an end user point of view.

But first what is Spinnaker?

The first time I heard the word Spinnaker, my reaction was, “wait, what does that even mean in English?”

Shortly after, I found myself implementing a demo of Spinnaker as a potential replacement for our internal cloud deployment tool.

Spinnaker is a cloud agnostic continuous delivery tool, which means we can push our code to any cloud provider we like. In fact, Spinnaker takes agnosticism to the next level by introducing three abstractions.

Load balancers
Server groups
Security groups

By enforcing this level of simplicity, it allows the implementation of deployment strategies such as Highlander, Red/Black on a vast different type of infrastructure (VM, Container, Kubernetes, public cloud, private cloud) with a high level of confidence and an incredible level of ease of use for the app developers.

Spinnaker also roots for the immutable infrastructure design pattern. Baking your image once and deploying the image everywhere is another bold move that differentiates Spinnaker from the other tools.

Why Winnaker ?

Short answer is because of automation!

Test the functionality of the CD system as a whole.

Spinnaker has different components (CloudDriver, Rosco, Deck,…). Each of these components have their own unit tests and health checks that can be monitored.

We learned the hard way that relying only on component health checks is not effective enough to ensure developers won’t face any error when they deploy their apps.

A few things can go wrong when off monitoring radar:

Connectivity between the separate components
Maxing out cloud provider API rate limit
Base infrastructure configurations (subnet address space, identity management roles)

So we decided to audit Spinnaker and cloud’s whole functionality with a sample app. If baking and deploying our sample app in different accounts and regions passes, then we are positive that it works!

However, that kind of testing is time consuming and boring for humans. Winnaker brings back the fun to the testing.

Winnaker is the product of automating the auditing of your deployment process.

Automate Troubleshooting

Every error in Spinnaker means something new that we document, but who reads the documentation? Additionally, documentation goes out of date all the time.

Winnaker has a list of known error messages and it comes up with suggestions that you may want to use based on the error message.

For instance, this is an example of a Winnaker output :

- Failed for : This application has no explicit mapping for /error
- Suggestion : Check Deck

And you can add your own suggestion for errors.

What are the features of Winnaker ?

Start a pipeline on Spinnaker with different options (force baking, deploy,…)
Get stage details and return the non-zero error code.
Screenshot the stages
Pressure test your cloud deployment
Integrates with HipChat
Troubleshoot suggestions
Works with different cloud providers.

How do you install Winnaker ?

There is nothing to install. Everything ships in a docker container. Winnaker uses chromedriver, python, virtualdisplay and selenium. Installing any of those things separately can be a recipe for headache sometimes.

What do you need ?

A Spinnaker URL
A sample app and sample pipeline to run

More extensive documentation is located in Winnaker’s GitHub repository. Please feel free to open issues or submit PRs.

About the Author

Medya Ghazizadeh is a Senior Engineer and part of Target’s Cloud Platform Engineering team.

Win the cloud with Winnaker! was originally published by Target Brands, Inc at target tech on February 13, 2017.

2016-09-29T00:00:00-00:00

At Target we aim to make shopping more fun and relevant for our guests through extensive use of data – and believe me, we have lots of data! Tens of millions of guests and hundreds of thousands of items lead to billions of transactions and interactions. We regularly employ a number of different machine learning techniques on such large datasets for dozens of algorithms. We are constantly looking for ways to improve speed and relevance of our algorithms and one such quest brought us to carefully evaluate matrix multiplications at scale – since that forms the bedrock for most algorithms. If we make matrix multiplication more efficient, we can speed up most of our algorithms!

Before we dig in, let me describe some properties of the landscape we will be working in. First, what do I mean by large scale? A large scale application, at a minimum, will require its computation to be spread over multiple nodes of a distributed computing environment to finish in a reasonable amount of time. These calculations will use existing data that are stored on a distributed file system that provides high-throughput access from the computing environment. Scalability, in terms of storage and compute, should grow as we add to these resources. As the system grows larger and more complex, failures will become more commonplace. Thus, software should be fault-tolerant.

Fortunately, there is a lot of existing open-source software that we can leverage to work in such an environment, particularly Apache Hadoop for storing and interacting with our data, Apache Spark as the compute engine, and both Apache Spark and Apache Mahout for applying and building distributed machine learning algorithms. There are many other tools that we can add to the mix as well, but for the purposes of this post we will limit our discussion to these three.

With that out of the way, lets dig in!

Don’t Forget the Basics

Begin with good old paper and pencil. Yeah, I know this is about large scale matrix operations that you could never do by hand, but the value in starting with a few examples that you can work out with pencil and paper is indispensable.

Make up a few matrices with different shapes - maybe one square and nonsymmetric, one square and symmetric, and one not square.
Choose the dimensions such that you can perform matrix multiplication with some combination of these matrices.
Work out the results of different operations, such as the product of two matrices, the inverse of a matrix, the transpose, etc.

There is no need to go overboard with this. It can all fit on a single piece of paper, front and back maximum.

Test the Basics

Now it’s time to start making use of Spark and Mahout, and the best place to begin is by implementing the examples you just worked out by hand. This has two obvious purposes:

You can easily verify that you are using the packages correctly. i.e. Your code compiles and runs.
You can easily verify that what you have calculated at runtime is correct.

What is nice about these two simple things, is that a lot of important concepts are unearthed just by getting this far. Particularly, how you take your initial data and transform it into a matrix type and what operations are supported.

What you will find is that there are 3 basic matrix types (in terms of distribution); the local matrix (Spark, Mahout), the row matrix (Spark, Mahout), and the block matrix (Spark). Each of these types differ in how they are initialized and how the matrix elements are distributed. Visually this is depicted for 3 different m x n matrices below:

In the above figure you can think of each block as representing a portion of the matrix that is assigned to a partition. Thus, a local matrix is stored entirely on a single partition. A row matrix has its rows distributed among many partitions. And a block matrix has its submatrices distributed among many partitions. These partitions can in turn be distributed among the available executors.

What’s the takeaway? Applications that rely solely on operations on local matrices will not benefit from additional executors, and will be limited to matrices that can fit in memory. Applications that use row matrices will benefit from additional executors, but as n gets larger and larger will suffer from the same limitations as local matrices. Applications that rely solely on block matrices can scale to increasingly large values of m and n.

The caveat is that underlying algorithms increase in complexity as you move from local to row to block matrices. As such, there may not be an existing pure block matrix implementation of the algorithm you wish to use, or the algorithm may feature a step that multiplies a row matrix by a local matrix. Understanding the implementation details from this perspective can help you to anticipate where the bottlenecks will be as you attempt to scale out.

From Basics to the Big Leagues

At this stage we aim to ramp up from operations on small matrices to operations on larger and larger matrices. To do this, we begin by sampling our real data down to a small size. Using this sample data we can benchmark the performance of the matrix operations we are interested in. Then, we move to a larger sample size and repeat the benchmarking process. In this way, we can ramp up to the problem size of interest, while building and developing intuition about the resources required at each new scale.

In benchmarking performance, there are basically 3 parameters that we can play with:

The number of executors.
The amount of executor memory.
The number of partitions that we decompose our distributed matrices into.

There are other parameters we could consider, such as the number of cores per executor. But these 3 will serve as a very useful starting point.

Number of Executors

We can hone in on the number of executors to use based on a strong scaling analysis. For this analysis, we run our calculation with our sample data using p executors and record the time. We then repeat the exact same calculation with the same sample data but now use 2p executors and record the time. We can repeat this process, increasing the number of executors by a factor of 2 each time. Ideally, we would hope to find that every time we double the number of executors our calculation completes in half the time, but in practice this will not be the case. The additional complexity of the distributed algorithms and communication cost among partitions will result in a performance penalty, and at some point during this process it will no longer be “worth it” to pay this penalty. This is shown in the figure below:

In the above figure we show strong scaling for the multiplication of two square block matrices. The x-axis represents the number of cores used in a carrying out the multiplication. The y-axis represents scalability, defined as the amount of time it takes to complete the calculation on a single core divided by the amount of time it takes to complete the same calculation using p cores. The blue curve represents ideal strong scaling as discussed above. The green curve represents the product C = A*A, and the red curve represents D = B*B. Matrix A fits comfortably in memory on a single core, and Matrix B is two orders of magnitude larger than A. What is apparent is that there is little to no benefit of parallelism in calculating the product C. Whereas, for the larger product, D, we observe a nearly linear speedup up to 100 cores.

We can decide if it is “worth it” more concretely by setting a threshold for efficiency, defined as the ratio of the observed speedup versus the ideal speedup. A fair value for the efficiency threshold could be in the range of 50 to 75 percent. Higher than this is perhaps unrealistic in many cases, and lower is probably wasteful, as extra resources are tied up but not sufficiently utilized.

Executor Memory

We can think about the executor memory in two different ways. One option, in terms of performance, is that we desire to have enough executor memory such that we avoid any operations spilling to disk. A second option, in terms of viability, is to assess whether or not there is enough executor memory available to perform a calculation, regardless of how long it takes.

It is helpful to try to find both of these limits. Begin with some value of executor memory. Monitor the Spark UI to detect if data is spilling to disk. If it is, rerun with double the executor memory until it no longer does. This can represent the optimal performance case. For the viability option rerun while reducing the executor memory by a factor of 2 until out of memory exceptions won’t allow the calculation to complete. This will give you a nice bounds on what is possible versus what is optimal.

Number of Partitions

A final parameter for tuning that we will consider here is the number of partitions that we decompose our distributed matrices into. As a lower limit, you want the number of partitions to match the default parallelism (num-executors * executor-cores), else there will be resources that are completely idle. However, in practice, the number of partitions should be the default parallelism multiplied by some integer factor, perhaps somewhere between 2 to 5 times is an optimal range. This is known as overdecomposition, and helps reduce the idle time of executors by overlapping communication and computation.

In fact, we could overdecompose by much larger factors, but we would see that the performance degrades. This is because the benefit of overlapping communication with computation is now less than the overhead of the compute engine managing all of the different partitions. However, if we are aiming for viability instead of performance this may be the only way to proceed. Increasing the number of partitions will reduce the memory footprint of each partition, allowing them to be swapped to and from disk as needed.

Performance Tips

As you ramp up to larger and larger problem sizes performance will become more and more critical. I have found that simply adjusting some spark configuration settings can give a significant boost in performance:

Serialization concerns the byte representation of data and plays a critical role in performance of Spark applications. The KryoSerializer can offer improved performance with a little extra work: spark.serializer=org.apache.spark.serializer.KryoSerializer. See here for more info on serialization.
If you notice long periods of garbage collection, try switching to a concurrent collector: spark.executor.extraJavaOptions="-XX:+UseConcMarkSweepGC -XX:+UseParNewGC". I have found this to be a nice reference on garbage collection.
If you notice stages that are spending a lot of time on the final few tasks then speculation may help: spark.speculation=true. Spark will identify tasks that are taking longer than normal to complete and assign copies of these tasks to different executors, using the result that completes first.

Wrapping Up

It is fair to think that this process is tedious, and that there is a lot to digest, but the intuition and knowledge developed will serve you as you continue to work with these tools. You should be able to short-circuit this process and enter at the ramp up stage, beginning with a relatively modest sample size. Your experience will guide you in selecting a good estimate for the initial amount of resources and you can continue to tune and ramp up from there.

If you found this interesting then you may be interested in some of the projects that we are working on at our Pittsburgh location. We have positions available for data scientists and engineers and a brand new office opening in the city’s vibrant Strip District neighborhood.

About the Author

Patrick Pisciuneri is a Lead Data Scientist and part of Target’s Data Sciences team based in Target’s Pittsburgh office.

How Target Performance Tunes Machine Learning Applications was originally published by Target Brands, Inc at target tech on September 29, 2016.

2016-03-01T00:00:00-00:00

On my first encounter with it, around early 2010’s, I was mystified. It sounded like witchcraft and I imagined the practitioners to be a coven of witches and wizards, all holding Ph.D.s in the dark art of “Data Science” and being respectfully addressed as “Data Scientists”. It was believed they would magically transform haystacks into gold and then ask for your first-born in return as a reward for their service (a la Rumpelstiltskin) There is no denying the fact that the title “Data Scientist” is the most coveted one these days and has a nice ring to it. It’s also true that data science has traditionally been a monopoly of mathematicians and statisticians. Obviously, developing statistical models and machine learning algorithms requires years of training and practice to specialize. In my opinion it is more of an art form driven by science and can easily be mistaken for magic.

It’s common knowledge that the more experienced in life we get, the easier it is for us to make up our mind. For instance, “What diner to pick for a boy’s-night-out?”, “When to stay off highways to avoid being stuck in a traffic-jam?”, “When to buy a house? When NOT to buy?”, are all such decions we make everyday. This ability comes as result of years of learning from implicit experience (a.k.a unsupervised learning) and explicit instructions from parents, teachers, friends, family and media (a.k.a supervised learning.) Our brain builds models of the world, of the situations we have been in, of banal and extraordinary, of nice and not-so-nice, of appropriate and inappropriate etc. These models facilitate judgment, govern behavior and enable anticipation of likely outcomes. That’s basically data science. The recent progress in large scale and high performance computing has opened doors for such complex calculations to be performed on-demand and much more efficiently than was possible before. Hence, the buzz!

At Target we operate in a guest-centric universe. We don’t treat our guests as a statistic or as just a data-point in a trend. Our guests are setting their own micro-trends. Our focus is on carefully picking signals from our guests and learn what actually matters to them. Yes! We are seriously trying to understand each single guest’s needs independently. Therefore, we strive for a fully personalized experience and not just making suggestions on what is popular out there. As John Fairchild has allegedly said:

“Style is an expression of individualism mixed with charisma. Fashion is something that comes after style.”

At Target, Data Science and Engineering is a group that has gone beyond the conventional boundaries in terms of scale and applicability. To us the journey to the ultimate goal of 100% personalized experience began with the attention to detail, like what would our choice of algorithms be, how would we organize the data, what level of pre-processing would be enough, the frequency of compute cycles, strict SLAs on API exposed and so on. We were very clear that we needed to provide a consistent experience before worrying about fully personalized experience; establish reliability before making bunnies appear out of a hat. And our motto has been pretty simple, no matter what channel our guests take to interact with us we want their experience to remain consistent and pleasant.

What we do is driven by non-trivial problems that mandate an ensemble of approaches to be used. It is a slow and deliberate process of trial-and-error, experimentation with bleeding-edge algorithms/technologies and at times engaging in dialogue with peers and stake-holders, that can run for days. We have played with off-the-shelf algorithms as well as built new techniques. It is such a delight to witness the speed with which new ideas emerge from the insight built through models that use big data as input. We can understand our guests’ behavior and needs better because of our omni-channel retail capabilities.

“ It is such a delight to witness the speed with which new ideas emerge from the insight built through models that use big data as input.”

Data science and engineering teams that are tasked with solving business problems with quick turn-around times have a burning need to pre-process, ingest, store, process and retrieve large amounts of data fairly quickly and without compromising on security, quality or privacy. In the first year we have mostly focused on the design of long-running data pipelines, multi-phase compute workflows, highly scalable API layer and monitoring capabilities all aiming for the fastest turn around time. To name a few techniques, we have used Collaborative Filtering for behavior based recommendations, TF-IDF for feature selection, K-Means for clustering. Our exploration has a wider range though, as we are deeply interested in commoditizing data science for all our product teams within Target.

A key element in our ability to execute has been our choice to go with OpenSource almost every single time we had to make a technology decision with only a few exceptions here or there. We also rely on the contemporary Software Engineering and Management principles proposed by Agile development model, which has served us very well. Our engineers have a very wide variety of skills. Statistics, Machine Learning and Visualization techniques, delivered through Java, Scala, Python, Spark, R, Hadoop and many more technologies. That’s how we transform abstract ideas into practical and well-engineered products.

We are doing some kick-ass engineering with the focus to build value-driven products through technology. And that’s what data engineers do. For us, no idea is too big to try, and a failure is just a null-hypothesis we are trying to reject. Our dream is to go where no one has dared to go before and bring back the riches.

“No idea is too big to try, and failure is just a null-hypothesis we are trying to reject.”

Sometimes the data is big and sometimes it is small. Sometimes we need statistics and other times we need just a few occurrences to infer something. Most of the time we can explain it and some times we just conjecture about it. I must admit that working with data can be hard due to volume, variety and veracity of data. It can test the limits of one’s competence as well as patience. The patterns in the data or the patterns of data engineering are never easy to identify and hence the approach to solve each problem has to be defined as we go. Data Science and Engineering may very well be witchcraft; all it needs is wizards and witches like us to master it. With strong analytical skills, good hold on foundational mathematics, some engineering background and stamina to keep up with the exponential learning curve you can be a data scientist/engineer. Most of the problems we encounter are pretty unique and solving data-driven problems is very exciting because you can confirm or reject your ideas pretty quickly, making it a very rewarding experience as an engineer. We receive directions and guidance from our Product Management and Merchandising teams on how business is shaping and what objectives are critical in the coming quarter or year and we start on defining the problems and breaking them down into tasks. This is pretty much similar to general software engineering but the key difference is the nature of the problems we solve, the scale at which the solution must work, the number of end-users that get impacted by our work and most importantly measuring and making sense of that impact. In my first year at Target working as Principal Data Engineer, there have been no dull-moments except the ones when I had to catch up on my sleep. So as I draw this first installment of DSE Update to a close, I want to leave you with a thought.

“All Science is Data Science. Because without data, there can be no science. And all of us are Data Scientists, while some of us pursue it as a career.”

About the Author

Product Recommendations (Personalization Engine) Lead and Principal Data Engineer at Target, delivering value-driven service at scale.

(Data) Science or Witchcraft? was originally published by Target Brands, Inc at target tech on March 01, 2016.

2016-01-13T00:00:00-00:00

When you first think about scaling an on-premise Hadoop cluster your mind jumps to the process and the teams involved in building the servers, the time needed for configuring them and then the stability required while getting them into the cluster. Here at Target that process used to be measured in months. The story below outlines our journey around scaling our Hadoop cluster, taking the months to hours and adding hundreds of servers in a couple weeks.

The Need

Early 2013 taught us the lesson that manually managing Hadoop clusters, no matter how small is a time consuming and very repetitive task. Our next cluster build in 2014 drove the adoption of Chef, Artifactory and Jenkins to help with cluster operations. We stood up those components and created new role cookbooks to manage everything on the OS (configurations, storage, Kerberos, MySQL, etc.). While this was a step in the right direction, it left us with a manual process to still create the initial base server build and then add it to the cluster after configuring it with Chef.

Build Foundation

Closing the gap in our automation meant finding a way to deliver on true end to end builds, from an initial bootstrap to running jobs in your cluster. OpenStack’s Ironic project was the first piece of the puzzle. Ironic gives us the ability to provision bare metal servers, similar to how OpenStack automated the VM build process. With Ironic as the foundation, we leveraged the Nova client to manage our instance builds. The nova python client interacts with the Compute service’s API, giving us an easy way to specify our build parameters and spinning up an instance on one of our physical servers. The other key piece with the nova client is the ability to send boot information to the server using user data. The bash script sent executes several commands to install the Chef client, setup public keys and run the initial knife bootstrap to set the run list for the build.

Example nova boot command:

nova boot --image $image_name --key-name $key_name --flavor $flavor_name --nic net-id=$network_id --user-data baremetal_bootstrap.sh $instance_name 

Server Configuration

The knife bootstrap portion of our user data script sends the instructions for Chef to build a certain node (Hadoop Data Node, Control Node, Edge Node, etc.). The types of nodes are managed with role wrapper cookbooks, giving us an easy way to manage attributes and run lists without changing our core code. This makes it easy to deploy new features and bug fixes to a specific cluster or role within a specific environment for example.

Example role attributes:

# Hadoop 
default['hdp']['ambari_server'] = 'ambari_server'
default['hdp']['ambari_cluster'] = 'cluster_name'

# MySQL
default['hdp']['mysql_server'] = 'mysql_host'

# Kerberos
default['hdp']['kerberos_realm'] = 'KDC_REALM'
default['hdp']['kdc_server'] = ['kdc_hosts']

# Software 
default['hdp']['ambari']['repo_version'] = '2.0.2'
default['hdp']['repo_version'] = '2.2.4.12'
default['hdp']['util_repo_version'] = '1.1.0.20'

In the case of a Data Node role, the build runs through the following:

Setting up internal software repo’s
RHS registration
Raid controller installation/JBOD configuration
Base Target OS build
Hostname management
Centrify rezoning
Autofs (Unix home directories)
OS tuning (Hadoop)
Kerberos (client installation and configurations)
Disk formatting, partitioning and mounting
Ambari host registration
Ambari client installation
Ambari service installation
Ecosystem Setup (R, Python, Java, etc.)
Ambari host service startup

Home Stretch

Now that we have a fully configured Hadoop Data Node, we needed a way to add it to our cluster without impacting existing production jobs. This is where Ambari comes in the picture. Ambari is our primary Hadoop administration tool which provides both a UI and RESTful APIs to manage your cluster. Using the UI was a no-go for us since that would mean manual intervention. This final push was going through the API to complete the end-to-end automation. Starting with Ambari 2.0, you can integrate Ambari with your Kerberos KDC to have it manage principal and keytab creation. In order for this to work through the API, you needed to ensure your KDC admin credentials were being passed. One way to do this is using the curl session cookie. This json file was created using a Chef template with an encrypted data bag to securely provide credentials.

kdc_cred.erb Chef template example:

{         
 "session_attributes" : {
   "kerberos_admin" : {
      "principal" : "admin/admin@<%= node['hdp']['kerberos_realm'] %>",
      "password" : "<%= @password %>"                         
      }         
  } 
}

You can create the cookie with the curl -c option, using the json file which contains your KDC admin credentials. With the session cookie created, you are all set to run through the calls needed to fully install and start a service.

1.Service Addition

This step registers the new service on the host specified and leaves it in an “Install Pending” state. Example command for adding a node manager to host worker-1.vagrant.tgt:

curl -b /tmp/cookiejar --user admin:<pass> -w -i -H 'X-Requested-By: ambari' -X POST http://ambari.vagrant.tgt:8080/api/v1/clusters/vagrant/hosts/worker-1.vagrant.tgt/host_components/NODEMANAGER

2.Service Installation

This step continues from the registration and actually installs the service on the host. After the installation is completed the service is in a ‘Stopped’ state but is ready for use. Example command for installing a node manager to host worker-1.vagrant.tgt:

curl -b /tmp/cookiejar --user admin:<pass> -w -i -H 'X-Requested-By: ambari' -X PUT -d '{"HostRoles": {"state": "INSTALLED"}}' http://ambari.vagrant.tgt:8080/api/v1/clusters/vagrant/hosts/worker-1.vagrant.tgt/host_components/NODEMANAGER

3.Maintenance Mode

This step suppresses all alerts for the new service since it is installed but in a stopped state. You could start the service at this point, but the rest of the build process for our internal builds needs to complete so this will stay down until the very end. Example command for putting the node manager on host worker-1.vagrant.tgt in maintenance mode:

curl -b /tmp/cookiejar --user admin:<pass> -w -i -H 'X-Requested-By: ambari' -X PUT -d '{"HostRoles": {"maintenance_state": "ON"}}' http://ambari.vagrant.tgt:8080/api/v1/clusters/vagrant/hosts/worker-1.vagrant.tgt/host_components/NODEMANAGER

4.Service Startup

With the rest of our build completed we can remove the service from maintenance mode and start it up, so jobs can start running on this host. Example command for starting the node manager on host worker-1.vagrant.tgt:

curl -b /tmp/cookiejar --user admin:<pass> -w -i -H 'X-Requested-By: ambari' -X PUT -d '{"HostRoles": {"state": "STARTED"}}' http://ambari.vagrant.tgt:8080/api/v1/clusters/vagrant/hosts/worker-1.vagrant.tgt/host_components/NODEMANAGER

The above examples walked you through the process to fully add a Node Manager service using Ambari’s API. The Chef recipe’s created internally leverages that logic as the foundation, but optimize the process to handle multiple client and service installations. Additional information on Ambari’s API can be found on Ambari’s github or the Ambari wiki.

The Results

With the above process we can complete a single data node build in 2 hours. The great thing about the nova client was the ability to script out your builds, allowing us to run parallel builds and reaching our expansion goals by adding hundreds of servers within a couple week timeframe!

Issues Encountered

Our automation efforts ran into issues like any other project. The two larger challenges we faced were finding a way to register new hosts in our DNS and then getting Ambari to scale with our builds.

On the DNS side, not having access to create new records and not wanting to get Change approvals in the process forced us to look at a bulk DNS load option. DNS records were created for the new IP space and during the build the server would lookup the hostname based on the IP address it was assigned. After setting the hostname to the new value and reloading Ohai, we were back in business.

Due to a known bug, Ambari’s host_role_command and execution_command tables were growing at an alarming rate. The more servers we added, the larger the tables became and the longer the installs took. With our systems freeze weeks away, we couldn’t afford to wait and go through an Ambari 2.1.x upgrade where that bug was being addressed. We ended up adjusting the indexes and purging those tables to continue with our builds.

What’s Next?

Community give back (wiki, jira’s for features and issues encountered) with the goal of helping to solve some of those problems ourselves. Outside that we want to continue to enhance our initial build MVP by refactoring some of the code and adding validations and build notifications. All of this will continue to mature our build process, getting us ready for more growth in 2016.

Bare Metal Big Data Builds was originally published by Target Brands, Inc at target tech on January 13, 2016.

2015-11-11T00:00:00-00:00

An enterprise as large as Target generates a lot of data and on my Big Data platform team we want to make it as easy as possible for our users to get it into Hadoop in real-time. I want to discuss how we are starting to approach this problem, what we’ve done so far, and what is still to come.

Our requirements

We wanted to build a system with flexible open source components. Our experience with proprietary products on Hadoop is that they tend to be inflexible and only work with a narrow set of use cases. That can also be true of open source products, but we have found them to be easier to adapt to our needs. Indeed, that ended up being the case as we went through this journey and made several contributions to Apache projects.

More specifically, we wanted to create a system that:

was highly resilient to the failure of any individual component
supported a variety of message formats
delivered data to Hadoop with very low latency
made the data immedaiately usable
streamed data from Apache Kafka sources

A Streaming Framework

There are many excellent comparisons of streaming frameworks available, and I won’t attempt to recreate them here. The first criteria we considered was what tooling was needed to monitor and administer the streaming framework, with a strong preference to use our primary Hadoop administration tool, Apache Ambari. Apache Storm fit that bill and was also a proven solution for stream processing. If Storm could meet our other requirements, it would be our first choice.

To test its resiliency we ran a simple scenario: start a data stream into Hadoop, disable HDFS, and then reenable it.
Streaming would obviously fail while HDFS was disabled, but we needed the system to recover gracefully when HDFS came back online. Unfortunately our first test of this scenario left our Storm topology in an unrecoverable state, which required a manual restart. That’s not something we could live with.

We also needed very fine control over the latency of arriving data. In general Storm processes messages one at a time and not in batches, which leads to very low latency. However, almost all Storm “bolts” that write data to a destination system will batch their writes for effeciency. If the batch size was set too large final latency could be quite high as no data was written until the batch filled up. We observed that behavior when we set the batch size to be several minutes worth of data in a test scenario.

However, examination of the Storm source code led us to conclude that both of these problems were solvable. We contributed fixes for both of these issues back to the Storm project as part of STORM-969. With those issues resolved we were comfortable that Storm would meet our initial requirements to get data into HDFS.

Making the data immediately usable

Now that raw data was landed in HDFS, we wanted to make it immediately consumable via external Hive tables. In general that requires knowing the schema of the data when you define the pipeline. You must also be aware of schema changes, especially incompatible schema changes, made by the upstream systems producing the data you’re consuming. This proved to be a challenge for several teams because they could not guarntee a source of truth or restrict schema changes. Additionally, many teams were trying to consume JSON formatted messages. JSON, like XML, is a popular message format, but not one particularly well-suited to Hadoop because it is not splittable. However, one team took an approach that solved almost all of these problems at once.

This team adopted Apache Avro as a message encoding scheme. Avro also defines schemas and schema changes. Together with a centralized schema registry, this creates a source of truth for the schema and can, if enabled, allow only foward and backward compatabile schema changes. Avro is also robustly supported by Hadoop and Hive and allows true schema on read.

By streaming binary Avro data directly into HDFS and creating a Hive table referencing the enterprise definition of the schema we could make the data instantly usable and robust to schema changes. Our next obstacle was that Storm did not support writing Avro data to HDFS!

Storm and Avro

Working on the problem of adding an Avro output option for Storm resulted in a tremendous amount of code duplication to get the enhancements we had contributed to STORM-969. Since code duplication is really bad we instead looked to see how we could refactor the HDFS facilities in Storm and make them more extensible, which led to STORM-1073. This enhancement made it much easier to extend Storm to write new data formats to HDFS while keeping all of the improvements we made for resiliency and throughput. With these changes in place we opened STORM-1074 to add support for writing Avro data. The very minimal amount of code needed to add this feature was a validation of our refactoring approach.

Finally, we ran into a small issue with Hive’s handling of Avro data. Storm would sometimes leave behind zero-length files where Hive was expecting to see Avro data. A zero length file is not valid Avro and any Hive queries on that directory would fail, even if other perfectly good Avro files were present. That led us to open and contribute a fix for HIVE-11977.

A Commitment to Open Source

Submitting all of these changes back to Apache projects definitely added work for us; it would have been much simpler to keep them to ourselves. But Target is committed to Open Source Software and we understand the virtuous cycle behind our contributions. By making products like Storm better, others are more likely to adopt them and make contributions of their own that will directly benefit us.

What’s next? (Spoiler: A Lot!!)

While we can immediately work with the raw Avro data in Hive, it does not lead to fast queries and performance will degrade as our data sets grow. We need to partition the data and ultimately fit it into a columnar format like Apache Parquet or Apache ORC. We’ll also likely have to refresh these transformation as the schema evolves over time.

We also have designs on ingesting the data to Apache Solr and Apache Phoenix

The lessons we’re learning through this process are going back to the data creation teams as well so that they can create message schemas to facilitate these transformation and still keep the benefits we talked about above.

Finally we want to store metadata about these processes in a tool like Apache Atlas so that users can reason about dataflows on our Hadoop platform.

##If you want to help us solve these problems and more, we are hiring! You can reach me at aaron.dossett@target.com

Aaron Dossett is Senior Lead Data Engineer at Target and a contributor to several Apache Software Foundation projects.

Real-time Big Data at Target was originally published by Target Brands, Inc at target tech on November 11, 2015.

2015-08-14T00:00:00-00:00

At Target we’re always looking for ways to move forward in becoming the best omni-channel retailer that we can be. This journey demands that we enable change in most every part of our technology organization. Our culture, delivery model, technology selections, working arrangements and org structures are all levers we can pull to help us be more responsive. A key question that we continually ask ourselves is “how can we move faster”? Introducing change in a large enterprise can take a long time if we don’t challenge ourselves to be creative around constraints (like not enough “experts” to go around). Traditional learning approaches might not be fast enough, so we’ve started acting on some innovative ideas to break down those barriers.

While this progress has been great, we’ve continually been asking ourselves how we could go faster.

###Building Capacity One of the newest things we doing is expanding our capacity for others to learn the new ways of doing things. Since we still have relatively few subject matter experts to go around, we are finding that access to these experts is a constraint that makes it difficult for us to scale the learning process for new teams. To address this constraint, we are standing up a learning “Dojo.” The Miriam-Webster dictionary defines a Dojo as “a school for training in various arts of self-defense (as judo or karate).” At ChefConf 2015, Adam Jacobs delivered a very interesting talk titled Chef Style DevOps Kungfu. This presentation talked about building your practice through repetition and development of skills. We needed a place to develop our own special DevOps Kungfu. Cue the Dojo!

The Dojo is a dedicated space in which our subject matter experts take up residence for an extended time. This is the home base for automation engineers, advanced Scrum leaders, OpenStack engineers, Chef experts, Kafka engineers, etc. Project teams and product teams looking to leverage these experts to build their own expertise can colocate their teams within the Dojo to have easy access to these resources. This learning time could be for a few days to a few weeks to a month or more. Providing access to this knowledge within a confined and dedicated area enables our automation engineers, DevOps practice leaders, and lead developers to move from team to team as needed to support their training and learning needs. We can support up to 8 or 9 teams at a time in the Dojo, which is a significant uplift from our existing reach of learning. Teams that “graduate” from the Dojo are expected to dedicate some resources back to the learning environment in future efforts, thus paying forward the investment to help grow the next group of individuals and teams.

Teams that spend time in the Dojo will be working on the full spectrum of skills, which is way more efficient than attending individual training classes. They will be building the automation of their infrastructure as they design their application using new application technologies. They will be learning Agile development practices, applying test-driven development methodologies as they strive for a continuous delivery model. The value in applying all of these concepts together is very powerful and we’ve seen great progress with the teams as they’ve gone through the Dojo. In most cases, the initial learning curve has been steep but we’ve seen teams hit their stride within a couple weeks and start to see huge productivity gains. These teams have emerged from the experience with much stronger skills and with a new working model that allows them to be much more self-sufficient and productive.

###What About The Others? While the Dojo enables us to level up more teams, we also see the need to place learning resources closer to the teams within their “home” organizations. This helps Target to reach engineers not involved in a Dojo project, as well as provide ongoing support and continuous learning to those who have already come through. To fill these gaps, we are in the process of identifying (or hiring), training, and placing coaches within our technology teams. These coaches have the skills and accountability to accelerate learning throughout the engineers in these teams. They’re available for questions and guidance as well as providing targeted teaching within their teams. To tie this together, teams that enter and exit the Dojo are engaged throughout with their coach, and Dojo graduates eventually move into some of these coaching roles.

We’re excited to see this progress continue as level up across our technology organization.

The Dojo was originally published by Target Brands, Inc at target tech on August 14, 2015.

2015-04-15T00:00:00-00:00

In 2014, I worked on a project to build out a system to provide new business capabilities for our Marketing and Digital spaces. This initiative was not unlike other initiatives we undertake at Target, but it provides a great example of the importance of us investing in an effort to modernize the way we operate, the technology stacks we use and the way we think about systems and systems architecture.

The project took in data in near real-time from multiple systems, processed the data and exposed it to consumers. The closer we came to real-time data the larger the benefit to our guests. There was only one problem – we had to keep one of our current systems running in parallel prior to switching on the new capabilities.

In this scenario, there were a couple of key frustrations I experienced which are symbolic of what is driving our need to modernize:

75% of the time spent on the project was integration with systems which would be shut off shortly after we went live.
The estimate for integration work with legacy systems was enough work to keep 34 full-time workers busy for an entire year.
The majority of the processing for this system was in a distributed environment, standing up the environment took over 20 weeks.
The data which would be exposed through this system would be 24 hours old due to limitations of the providing systems.
Every time we went through a release, we would initiate a fully manual testing process which spanned six weeks.
Decision making for this effort spanned over eight teams and four different organizations.

From what I have seen, Target is not that different from the majority of large enterprises which have been in business for half a century. We have grown over time, transitioning from one technology paradigm to the next, adding more and more technology debt into our environment as we move quickly to provide business value to our internal business partners and our guests.

I assume we are faced with the same pressures as other large enterprises: do more with less and do it faster. We also need to adapt to a changing retail climate driven to multiple channels of interaction and increased expectations from our more digitally aware guests.

And while we are trying to do more with less, we also have to ensure we do not interrupt the wealth of technology debt which exists from decisions which were made over the past 50 years.

Key Shifts We Are Making

Over the past six months the technical community at Target spent time defining what a Modernized IT environment at Target looks like and what cultural and technology shifts will have to be made to move us forward with IT being a major enabler of our future corporate strategies.

Some of the key shifts which we have started to make include:

Moving from an environment of highly centralized, highly shared and largely batch-oriented integration platforms to a more distributed, real-time, event-based patterns and implementations.
Moving from a tightly-coupled application architecture to a service-based, loosely-coupled architecture.
Moving our Business Intelligence and Warehousing investments from delivering reports to business users to an environment focused on developing advanced analytics into our core operational systems to take real-time actions driven by software.
Moving from an IT organization which is highly siloed (App Dev vs. Infrastructure) into a delivery model which embraces dev ops, continuous integration/continuous deployment and Infrastructure as code.
Driving all team members to a greater engineering focus.
Specifically focusing on building a more agile technology organization.
Streamlined accountability and ownership clearly identified and enabled in “full-stack” ownership teams.

We have started moving

As a start we have shaped the future state, identified initial high-priority technology investments which need to be made, secured funding and have mobilized team members to begin creating the Target IT world of tomorrow. Some of the key technology investments we have started and are accelerating include:

Kafka
Storm
Spark
Hadoop
Cassandra

As you can see by the key shifts outlined above, this is as much a cultural change as it is a technology change.

The time is now to drive Target into the future omnichannel retailer through heavy reliance on technology. Our Modernization efforts have already become a core part of the bet Target is making in the future. Our guests will be at the center of this transformation and we will be totally transforming our half a century of technology debt through the use of new technologies and a new delivery model, heavily based on open source technologies and a dev ops cultural mindset.

Watch for more posts on this as we progress. We plan to share what we are doing and what we are learning along the way.

Modernizing our IT Environment at Target was originally published by Target Brands, Inc at target tech on April 15, 2015.

2015-04-09T00:00:00-00:00

One important part of a retailer is their supply chain network. In order to sell a product in our stores, or online at Target.com, we need to have a well run and maintained network to move all of those products from one place to another. It may not sound like a complex problem to ship products from a distribution center to a store, but as soon as you start to have multiple vendors with multiple stores (not to mention online orders that go directly to a guest’s home) it becomes increasingly complex. Retailers get products from companies, referred to as vendors, and then distribute the products to stores through central locations, referred to as distribution centers (DCs). In order to improve throughput of our supply chain network we expand our distribution network by building new distribution centers or we increasing the efficiency of the current network. A great way to improve performance of a DC is to put robots and other automation equipment in it. High tech equipment like this requires high skilled labor to maintain and manage the equipment. These highly automated systems run on servers and other control equipment and they produce a lot of machine data that can lead to valuable insights.

Robotics has always been an interest of mine, starting with my experience in high school with the FIRST Robotics Competition and continuing today as I volunteer with FIRST. When I first joined Target right out of school with a degree in Electrical Engineering my curiosity and interest in robotics led me to get to know some of the engineers who worked on the robotics systems within Target’s DCs. I mentioned to them that I was working with a tool called Splunk that did a good job of aggregating data from different sources to a central location with a visualization platform built in. The engineers said they had a very cumbersome, inconsistent manual process which involved multiple spreadsheets and home grown macros to gather and visualize data. The data was spread out through multiple different locations and each different location had files that had the same names as the other locations. The files required advanced access to servers and then once someone got access to the data it was hard to manually move the files to a location where you could interact with them. Once the files were all on your computer and you could open them up in a program like Excel, it was a nightmare to try and parse and make sense of the data. There was so much data and it was so cryptic that it took over four hours (even with a macro that had been developed) each week to make a table that could show useful insights to how the robots were operating for each DC. High level reports were made for all the DCs that had these machines in them, but to initially make that report took over 300 man hours, and then an additional 5 hours each week to update the initial report.

These processes were not sustainable and resulted in reports being generated with incorrect or missing data. Often times reports would simply not get created leaving Target with no insights from the data. When reports were made correctly the best way to share them was via email, but then there were problems with not everyone having the reports or at least the correct version of the reports. Even when everyone had the correct reports, the reports were still somewhat cryptic and relied upon end users manually looking up error codes to see what kind of information the report was telling them.

Automation

Automating and simplifying these reports started off with a simple prototype that just took the raw data and aggregated it so you could see it all in one view at a machine and at a building level. It was a very simple table interface that was created by counting the number of errors for each robot and a lookup table that converted the error codes from numbers with sub error code numbers to words with useful easy to read English descriptions. Next we put a heat map over the data so someone could quickly take a glance at the visualization and see if a particular error was happening across multiple machines or buildings or if a particular robot was having multiple errors. We had it set to update once every hour, although it could have been set to be updated continuously.

We also set an alert so that it would send us an email if a server stopped sending data. Even before we were able to realize information from the improved dynamic report, we were able to find that our servers were not set up properly because we kept getting alerts that servers were not generating logs. This was because our servers would get rebooted, but the log forwarding service would not start. After a little while it started to appear that the robotics system was running worse; there were more and more errors than there had ever been before. After looking at the raw logs to verify that our new automated process was not broken, we found out that the old report was actually missing over half the errors because it could not properly handle some errors and the way they were formatted. Quickly we began to understand how much room there was for improvement; not just the error patterns we were seeking to understand, but more importantly that we an opportunity to improve the reliability of the infrastructure itself.

Operational Efficiency

We made a couple of simple graphs that just showed some some variables over time just to demonstrate what else we could do with the data. To encourage broader adoption of the benefits we were seeing, we showed this report to other engineers that manage and analyze similar automation systems. Seeing the potential to visualize, share, and alert on so much data in real time were strong selling point for them. They all saw the potential of what could be done with the data and started to create graphs, tables, dashboards, alerts, and other ad-hoc reports. Shortly after the other engineers made a graph from raw data that could fit the movement of the robots to a formula – now, whenever something is tweaked (maintenance schedule, sensors, parts, etc.) they were able to tell within a matter of days how it was affecting the components whereas before it would take at least a year before someone could visually notice the difference.

All of us gathered our successes and we shared these results with others; they were amazed at how quickly we were able to create reports with so much data in them. Everyone began to realize that the feedback loop between robot and engineer became much smaller which allowed them to take actions quicker and more frequently. Later that year at an industry conference, the maker of the robots highlighted our advancements in supply chain machine data analysis.

The team has since created many more graphs for both technical users and other business users to show how the DCs are operating.

The graphs help to drive productivity by giving accountability to those that are maintaining the equipment (e.g. if something is suffering from chronic failure we can easily check if machines require better maintenance).

Target can now perform preventative maintenance based on miles traveled, rather than just a timeframe. We can also see the trend of different metrics and perform maintenance before high usage days. This will save on doing maintenance that was not needed, and it will prevent downtime by doing preventative maintenance before something breaks.

With just this small scope implemented, significant money will be saved by increasing throughput by 4% per building. The data we now get can also show us that we should be processing different kinds of freight with different machinery. For example, some freight stays in the DCs for much less time, but we were not leveraging our automation systems there as often. This data showed us that we should be using it more, even if it is a smaller percentage of the inventory within a DC, this alone can increase throughput by 2% resulting in another impactful cost saving.

We have been able to objectively prove with more accuracy than ever that some buildings operate better than others. Since we have implemented these big data tools the amount of errors in a ‘bad building’ have dropped dramatically, as shown by this graph:

Next Steps

Now that we have proven that we can gather quick wins from this data, we are exploring big data solutions that will allow us to create machine learning models to let us know when something is going to fail before it actually does. We are actively creating predictive models by leveraging a wide assortment of tools (LogStash, Kafka, Hadoop, R, Splunk, InfluxDB, statsd, Grafana, etc.)

Shoutouts

Thank you to Paul Delfava, Adam McNeil, and Trevor Stratmann for jumping on board with this and making it happen!

Robotics Analytics @ Target was originally published by Target Brands, Inc at target tech on April 09, 2015.

2015-02-26T00:00:00-00:00

Setting the stage

Celebrity guests in the house

Jez Humble opens the day with a bang.The best in the business.

Ducy opens with a goat. To the shock of many, the rest of the talk was goat free.

Deming's obligatory appearance

Record attendance

Heather Mickman and Ross Clanton opening and closing the day!

Greg Larson delivers an excellent talk on building a strong engineering culture.

Brent Nelson, the man behind Target DevOps Days, talks tech clubbing.

Matt Helgen talks monitoring and open source

Target DevOps Days In Pictures was originally published by Target Brands, Inc at target tech on February 26, 2015.

2014-12-29T00:00:00-00:00

Recently we launched Project Argus, a 30-day “monitoring challenge” to improve visibility of key performance indicators (KPIs) across our technology stack prior to our peak retail season (in Greek Mythology, Argus is a 100-eyed giant). This effort was structured as a mix between a FlashBuild and agile. We used two day sprints, twice a day stand-ups, and feature tracking through Kanban boards. Quickly in this effort we decided to build our product as monitoring-as-code. Through the use of tools such as GitHub, Chef, Kitchen, Jenkins, and Ruby, we were able to quickly build several monitoring and dashboard solutions for use within our Technology Operations Center. These dashboards and the iterative process we use to continue delivering more content have been embraced by our support teams who now heavily rely on them to proactively detect and resolve issues.

One deliverable from Argus includes a cookbook per core business function that represents the KPIs identified by each business product owner. When run, the cookbooks establish connections to our data sources, build the scripts that process our data, and create crontab entries to automatically run the scripts at our predefined intervals. All of these resources and actions are defined in code.

We designed our solution so that the dashboards are decoupled from our centralized processing and metric creation. Additionally, our cookbooks are written to be dashboard agnostic and independent so that anyone can make changes to the appearance and we can switch dashboard solutions easily. The processing of our metrics is centralized and managed on-site with each cookbook receiving its own server. Finally, our dashboard solution detects lapses in data and generates alerts that post into our persistent chat client. This allows us to close the loop on the health of the monitoring system, which can be a painful endeavor!

##The Problem

Unfortunately, on 2014-12-06 at around 1:30pm the servers we use to process the monitoring data went down without warning. The dashboards our support teams had grown accustom to using were completely blank.

Within a few minutes, our dashboard solution started alerting of missing data. Our health monitors worked!

I greatly enjoy cycling; at the time of the outage, I was at a volunteer event for a non-profit organization called Free Bikes 4 Kids that donates bicycles to families in need. This meant that I was unaware of the outage and the alerts that were being generated. At around 2:00pm I was notified of the outage and I was able to verify that our dashboards were blank through my phone. Luckily, I was just wrapping up my volunteer shift and made the 20-minute trip home to where my laptop was located. When I arrived at home, a quick check revealed that the servers were not responding to pings or SSH requests. What could I do to restore functionality to the Operations Center during our busy Holiday season?

Automate! Rather than continuing to troubleshoot the issue, I realized that using our automation tools would provide the fastest recovery.

##Automation Saves the Day

Since we are only processing monitoring statistics, I knew that I would be able to process the information from my development machine without any problems. Additionally, I had been working on the code earlier in the week and already had the most recent copy of one cookbook downloaded. With all of this in mind, I used Test Kitchen to build a new Vagrant box and install the cookbooks. In about 10 minutes I had a new virtual machine built with all of our scripts installed and configured. Within a few more minutes the associated dashboards were once again displaying data and the alerts automatically updated to the recovered status.

By 3:00pm – an hour after being notified (including travel time) – the VM running my local, up-to-date code had recovered half of the dashboards.

The remainder of the dashboards were written by another group and the code in my local repo was significantly outdated. What should I do now?

Fortunately, they also used Git to manage all of the code and their scripts were also deployable as a Chef cookbook. I pulled down the most recent code from GitHub, modified the cookbook’s Test Kitchen file to use a Vagrant box that I already had downloaded, and then ran another kitchen converge. In about 10 minutes I had an additional virtual machine running and the remainder of the dashboards were correctly displaying data.

While the Vagrant boxes were being built, I took a deeper look into resolving the issues with the existing servers. Ultimately, I was unable to connect to either server through our hypervisor console, meaning that our conventional restoration methods were unavailable.

##Next Steps

Having the automation already in place for these dashboards made the entire process much easier and greatly reduced the amount of time needed to restore functionality to our Operations Center. Through this automation, we were able to completely build two new ‘servers’ and restore functionality within an hour and a half of the outage’s start. This also provided us with an opportunity to have the original server issue resolved through normal processes without having to increase the priority or contact additional teams so they could remain engaged in guest-facing technologies.

While monitoring-as-code has already proven itself as a worthwhile investment, we foresee several other potential benefits:

Anyone can contribute enhancements, upgrades, modifications. These are “living” monitors.
We can easily try out new hosting environments and move our code from one private cloud to another.
We have an additional disaster recovery option since we can essentially deploy our code to any available server.

Finally, in the spirit of DevOps and feedback loops, as a result of this experience, we would now like to automate more of the process so that our monitoring stack is self-healing. Ideally, when the alerts are generated indicating that data is missing, our infrastructure will automatically build and provision a new machine to run our scripts from until the original server becomes functional again.

Outage Resolution Through Automation was originally published by Target Brands, Inc at target tech on December 29, 2014.

2014-11-10T00:00:00-00:00

At a couple of recent conferences around DevOps (MSP DevOps Days and DevOps Enterprise Summit), Heather Mickman and Ross Clanton from Target explained some high level concepts and actions we’re taking to break our mold of working on complex problems. When presented with an opportunity to work differently and challenge our normal delivery of IT assets - from plan to decommission - what would that look like?

First, we asked some questions:

What can we do to address the barriers within our organization; can we remove that impact on our ability to get things done?
What can we do to remove time spent moving from meeting to meeting, often recapping the same discussions that you had in your last meeting 2 weeks ago?
What process barriers exist that can be included in the delivery and built as part of the ‘service’?

And then we asked … What if we could alleviate all of these problems (and more) all at the same time, all while you build team morale and actually have fun (gasp!) building stuff?

What if?

Two simple words that can: spark amazing conversations; generate innovation and excitement; challenge the status quo; break through to show how DevOps philosophies break through - even with the big horses. A ton of this post is simply about getting stuff done (GSD) but also to get into some specifics around how we approached fixing a business problem in some new ways - at least for us.

So let’s embark on some early details around Target’s journey to challenge the traditional approaches to get stuff done and take control of the ‘how’ … to redefine our working philosophies, principles, and minimize the impact radius of our efforts while still going deep with the teams and persons engaged. Any endeavor such as this has to have a name …

We affectionately call this process FlashBuilds!

What is a FlashBuild?!

Glad you asked!

FlashBuild: A one-day session that includes 2 agile sprints, all the smart people that you need to build stuff, a facilitator, and a whole lot of positive focused energy. You will solve difficult challenges faster than you thought possible, build amazing new products, improve lines of communication, and build relationships between different areas of your organization. Don’t get me wrong, not every FlashBuild results in a resounding success. Sometimes they fail to produce a finished, viable product. They never fail to produce incredible learnings and a greater appreciation for what your partners in the organization are working on. There is no such thing as a failed FlashBuild if you learn something and have fun!

A little deeper FlashBuilds are comprised of three main components:

Flash mob
Scrum
Hackathon

The name originated as part of a project to deliver end-to-end services around API enabled infrastructure. The goal was to leverage our infrastructure automation assets and implement self-service catalog products that our clients could consume - from cradle to grave. Subscribing to the philosophy of ‘minimum viable product’ (MVP), we iterated through defined releases - the next building off of the last - to create a rapid deployment model of new functionality. The service wasn’t just spinning up a VM and send login details to the requestor - that didn’t define the service delivery enough. Early on, we identified there was more to the MVP. We needed to enrich the communication between the service catalog and the back-end systems to avoid ‘request black holes’ to provide our clients with near-time visibility into their requests. We also needed to identify metrics for capacity and service health. And finally, provide actionable life-cycle events into our CMDB, from creation to retirement.

As we expressed these additional challenges, we articulated the requirements to our respective teams and looked at this effort not as a project but in a product model.

The setup - what do we need?

A business problem
Team members close to the problem to provide context
Team members to participate in (sometimes) strange, foreign roles
Passionate and engaged leadership committed to finding better ways to get stuff done

A Business Problem

The business problem will usually be defined in the context of a technology issue or obstacle. This is an easy thing to grasp when we approach the FlashBuild as ‘one part hackathon’ but process is just as much a sinking anchor to getting stuff done as obstacles around technology. We know that technology moves oh so very fast and process is an entry point for decay and entropy - and to lull people into a mindset that ‘the work today is good enough for tomorrow’. The business problem above, self-service end-to-end API enabled infrastructure deliver, enabled a core group of team members, enthused leadership, and stakeholders to rally and focus on new approaches that incorporate Agile mindsets and incubate this new method of FlashBuilds.

Close in Proximity

One of the aspects to FlashBuilds is proximity, which has at least two meanings - co-location of core team members close to the problem and short-access times to experts for fast feedback. We quickly identified the need to have team members in the same room during the effort but also needed to have close access to our system and process experts to bust through constraints that normally hamper progress … you can hear phrases when these things happen: ‘best practices…’, ‘the way we do it…’, and ‘standard process is …’

Our challenge back to that thought process was really pretty simply: ‘What if …?’

What if you didn’t have that process?
What if we could try something new?
What if that constraint didn’t exist?

What would it look like?

By flushing out the counterfactuals (we should, could, would do /that/ a different way), we typically identified with things in different ways that would allow us to, collectively, see what used to be must haves as nice to haves and our list of deliverables for an MVP would shrink.

Playing Roles

Many times, team members in the room play roles that were very new. This can include being a product owner, tech lead, facilitator (or scrum master), etc. The path we setup to the FlashBuilds follows a scrum approach to managing workflow - without getting too far into the weeds about terminology or applying pure scrum to the day’s activities. This is the ‘flashmob’ part of FlashBuilds. The teams are cross-functional and empowered to make decisions around the goal at hand. This empowerment is enabled through empathy, collaboration, and a safe place to test-and-learn (tech, process, and participating in new roles).

The product owner is the one person identified with accountability of helping to clarify the MVP. An essential aspect of this role is being able to articulate ‘DONE’ to the team with acceptable iterations - with limited timing for each FlashBuild, the end of day goal had to be both usable in helping address the business problem, functional for the product owner, and an obtainable delivery for the FlashBuild team.

The tech lead will fluctuate based on the technical or process focal point of the FlashBuild. This includes a shift of this role between different team members during the course of a FlashBuild. Expertise in the room helps to define this, not titles or pay grades.

The facilitator (or scrum master) keeps the room moving and on schedule. There are also times when this person may be tasked with helping address spikes and blockers the tech team identifies. For example, helping to gather team member resources from teams that aren’t in the room but some detail is needed from that external team to answer a question, verify information, or coordinate scheduling.

Leadership Drives the Culture

A critical component of enabling FlashBuilds is having leadership support - this include support of the end goal and to trust the process to complete said goal. This can be counterintuitive to some management philosophies with incentives focused on piece-work (closing tickets, for example) with past experiences that focus on controls around both ‘what’ to do and ‘how’ to get things done. When introducing foreign concepts, establishing clear feedback loops, an open environment for participation, and following through on commitments is vital to a well-orchestrated FlashBuild. Managers that ‘give up’ a team member for a day have to feel there’s some amount of compensation coming back into the respective area. This compensatory return can be in the form of new consumers to a platform, a team member able to share back with a home team lessons learned, or a new service/function of an IT asset.

The Setup of a FlashBuild

As mentioned above, FlashBuilds are setup as a one-day session … with an end of day goal.

We co-locate into one room. Preferably a room with windows, a big (BIG!) white board, enough room to spread out, and A/V equipment for both in room music and closing FlashBuild demos.

The structure for the day is somewhat typical to other ‘all day sessions’:

Gather at 9:00 am for initial planning :: This is the time for product owner to establish the business problem and answer basic questions
Planning at 9:15 :: User stories are identified and an MVP is established
Working session at 9:30 :: The core team breaks down the user stories and gets to #making_awesome_happen
Stand-up at 10:30 :: Adjustments are made based on any issues or concerns identified

This is really important as there may be a couple of smaller teams within the core group working multiple fronts (think a tech track and a process track) that have interdependencies

Working session at 10:45
12:00 lunch break & close of Sprint 1
Stand-up at 1:00 & start of new Sprint :: User stories updated and planning updated
Stand-up at 1:30 :: New sprint and user story merged with backlog
Working session at 1:45
Stand-up at 2:50
Working session at 3:05
Demo at 4:00 PM
Retro immediately following

Ahh… the demo! This is the opportunity to close loops, collaborate broadly, and get people invested irrespective of their perceived role or contribution. The demo becomes the spot where ‘the rubber meets the road’ as part of the FlashBuild. The demo is the MVP in a real, tangible form.

There are many similarities to Agile workflows: scrum, sprints, kanban boards, etc. The condensed implementation of Agile immerses people into the process without having to get too overwhelmed with terminology. In fact, many times the terms aren’t really that important or even socialized until team members start to ask questions.

For example, kanban boards. This is a big deal in FlashBuilds. We use a hybrid approach for kanban. We still have three main sections:

Backlog
WIP
Done

How do things differ? We don’t, out of the gate, normalize a rule set against how work flows through kanban. We keep things free flowing and task oriented at a level our core team is ready to consume. After all, if the team using tools like kanban to visualize workflow aren’t comfortable with the tool, reservation and distraction will come to light with concerns like: ‘Am I using it right?’, ‘What if I do it wrong?’. We start small and build up as new people get more accustomed to the tooling to render workflow, backlog, and ideas of spikes and blockers.

To simplify for folks even more, the WIP portion is more tactile. We use LOTS of paper notes - the kind with a little strip of adhesive - during FlashBuilds. WIP is taken from the Backlog to where ever you are working. You take it with you - as a reminder when distracted or as an announcement to others what you are working on right now. Once complete, the note is moved to Done. We’ve experienced examples where too much work gets put into WIP and that can quickly ‘scrum-fall’ you into getting NOTHING to Done. To counter this early on, you take your WIP note with you and work that to Done.

How About Outcomes?

We’ve used FlashBuilds for multiple business problems - technical and process alike.

For the self-service, end-to-end API friendly provisioning, we completed MVPs around server provisioning, including ITSM governance requirements, in a few full day sessions. This included portal access to instantiate, report, and deprovision workloads all with ITSM governance built into the process.

After a few more sessions, we leveraged the same framework built into the first release, to add self-service Apache and Tomcat options. Soon, we had highly available instances using HA Proxy. The last iteration was a full stand-alone Oracle database on a single VM.

Any of these deliverables complete in less than 20 minutes - in fact, our HA Tomcat would complete in under 8 minutes under normal conditions. A single VM would be complete in about 4 minutes. With ITSM governance built into the full life-cycle, there was no cumbersome change control form process or superfluous steps for a client or consumer to have to execute on to get to their assets. Having the ITSM experts in the room and part of the core team closed feedback loops faster and put resources in the conversation to clarify ‘must haves’ vs. ‘nice to haves’.

To deliver on the service side of this delivery, we leveraged our CMDB for all the IT assets. But went further than just providing metadata about the specific servers. Since the service was comprised of multiple platforms and technologies, we had to provide instrumentation and metrics about overall capacity and health of the environment and present that in a similar manner through the same portal to keep user experiences as close to the same as possible.

What is a FlashBuild Not?

We’ve tried to be very careful with where we employ the FlashBuild process. This is not a SWAT type tool … FlashBuilds could be but there’s arguable more rigor and discipline in Agile (i.e. sprints, demos) and that is most certainly an aspect that many team members find surprising about FlashBuilds. Also, if documentation of a process or technology were an end of day goal, the FlashBuild process probably wouldn’t be the best way to clear the documentation problems. However, based on the outcomes of a SWAT or documentation working session, there may be specific business problems that a FlashBuild approach could help in addressing.

Thanks to Target Team Members

To launch this type of effort, there are multitudes of different people that play significant roles in getting this type of product launched. I can’t remember everyone involved in every FlashBuild and the process of FlashBuild evolves and improves through every execution. A symbol of the strength of a process is the fortitude and endurance of those participating to not use FlashBuilds as an ‘in the moment experience’ but to take with them lessons learned on ways to execute against complex tech and process issues in her or his working day. It’s not by mistake or luck but effort and hard work that FlashBuilds improve.

By understanding broader systems-thinking approaches, different collaboration methods, and experiential learning, those who facilitate FlashBuilds find ways to improve (#KAIZEN) the FlashBuild product. Participants walk away DevOpsing … and sometimes, they didn’t even know. The most common outcome is minds open to different working models and comments like, ‘All of our work should be like this!’ and ‘When’s the next one to work on problem <______>?!'

But if not for the Target team members who support, facilitate, and participate - with open minds and a desire to improve - the product of FlashBuilds wouldn’t be as successful as they are.

Special thanks to Steven Bauer for helping me write and proofing this post.

Target FlashBuilds was originally published by Target Brands, Inc at target tech on November 10, 2014.

2014-10-15T00:00:00-00:00

Tuesday, October 14th 2015, Target affirmed its commitment to technology and leadership in the retail space by joining the World Wide Web Consortium (W3C). As a company that cares about delivering great experiences for customers, we realize those experiences are now steeped in technology. One way Target can have a significant impact on shaping technology is open engagement with other companies and organizations at the heart of emerging development - the W3C enables us to do just that.

So you joined - now what?

There are so many areas of technology important to retail and associated businesses. Initially, we’re interested in engaging with tech leaders, businesses, and organizations on web services, location-based content distribution (iBeacons, smart stores, etc.), and the Internet of Things. Through open dialogue and collaborative work in the W3C, we seek to push these areas of technology forward to enable new and better experiences for our customers.

More to come soon!

This is just the beginning. As an organization we will be engaging with the developer community in new and different ways, proposing ideas on the technologies that matter to us, and providing feedback of our progress. Now let’s get to work!

Target Joins the W3C was originally published by Target Brands, Inc at target tech on October 15, 2014.