Notes from the Orangerie: A Reflection on Process


Everything here is AI-generated: the scenes, the camera movements, the sound effects, the music. From top to bottom, I

   made nothing. And I made everything.

  I did not pick up a camera or a microphone. I did not do any field recordings. I hired no actors. I do not own a grand

   piano. I bought no lights and reserved no studios. I went forward 40 years and back a hundred right from the

  discomfort of a wobbly armchair in a corner of my very own room.

  

  I have no background in filmmaking or production design or sound editing, but I had a week off from work, and I am

  aspirational by nature. I have a penchant for daydreaming. And I can obsess. Years ago I chose to pursue painting over

   screenplay writing, and while I have never regretted taking that path, I have always hoped that I might find a way to

   work in film as well or in something film-adjacent. I would need to find a secret passage though, a shortcut that

  could lead me to a means of making moving images without requiring a decade of additional schooling, or the need to

  know exactly what a key grip is or does, or having to be one for that matter, to pay those dues so that at some

  indistinct point in the future I might achieve or be granted access to the people and tools necessary to bring a bit

  of imagination to life.

  AI has presented a back route. It turned out it is not exactly the shortcut I had expected it might be, but it

  presents a path nonetheless. And while it is no replacement for proper filmmaking, proper video art, TV, or the

  movies, as an individual and not a studio and as a person with relatively limited means, AI provides an avenue that is

   at the very least personally rewarding, and potentially Artworthy.

  To do what I wanted to do I needed tools. Many tools. I did not know that at first. At first, I had one tool, and by

  mid-afternoon, I had six. Some required month-long subscriptions. Others offered trial subscriptions. One has me

  cuffed for the year. Three were brilliant, two were a bust. I remain undecided on another. The next day I had four

  browsers open on three monitors. I had planned to take an evening and an afternoon to devote to this experiment. It

  was a week before I truly came up for air. Of course, there were dog walks and dinners and work meetings and birthdays

   and fishing with my family in between, but there were two all-nighters, and a half dozen missed meals. Creating can

  be intoxicating.

  I had help at every step of the way, though. I was never on my own, really. And before there were video tools or

  compositing programs, before any of it really, there were the consultants. A team of assistants and researchers I had

  assembled to be at my beck and call. I even sometimes picture them this way, with coattails and platters pulling off

  the tops to reveal a steamy tray full of knowledge on this or that topic. And this team was no ordinary team. Aside

  from being AIs, they were a crack bunch at that. Custom-conjured, one could say. I may have started with one of the

  standard language models, Claude or ChatGPT, but quickly moved on to refining base models to be more knowledgeable in

  specific areas and more attuned to various data regions in the latent space. I needed language models that were expert

   in specific domains to serve as consultants and specialists and gut checkers. So I fashioned one to remind me of

  differences in camera-shot terminology, the distinction between a whip pan and a swish pan, for instance, or the

  specific moment when a canted angle reaches a point of no return and can be legitimately confirmed as a Dutch angle. I

   had been glancingly aware of some camera terminology beforehand, but I was learning on the job and needed constantly

  to be brought up to speed. That particular model was a buff on cinematography and was helpful when I had need of a VFX

   supervisor-type. Its knowledge of cameras and lenses was far better than most, but nothing compared to this other

  expert I teased to life who is the embodiment, the veritable apotheosis of a lens nerd. This Director of Photography,

  or at least advisor to the DP (because that may be me), had plenty of skills but was not at all up to snuff when it

  came to sensible sound design details or narrative structure or color grading. So I coaxed those out of the machine as

   well. Before long I had knighted a small army. "You're an expert. And you're an expert. And you. And you. And you."

  In the end, we had a team of eight in all. Some were almost always at my side, others were tucked away and were only

  summoned on occasion. When one would struggle, I would invoke another. There were long stretches during which I needed

   no one at all and was off to the races. My opposable thumbs were necessary for mouse work and keyboard shortcuts, you

   see, and while my advisor on atmospherics and ambiance had a good deal of knowledge to impart on the physics of how

  sound travels and had much to say on the topic of Quantum Tunneling in Synapse Activity (the phenomenon of suddenly

  remembering a long-forgotten sound)…and it did yeoman's work trying to relate precisely how the speculative theory of

  the Bose-Einstein condensates in neural coherence may explain my experiences of unified sounds in my dreams. None of

  the team's input would have amounted to anything if I had failed to move in the world…to integrate the information,

  press buttons, make judgments, think, and feel.

  Aside from my cast of historians and camera geeks and fringe audio-physicists, I had a video editing guru, a

  mathematician (don't ask… ratio issues when scaling), and even a music theorist whom I only consulted a handful of

  times but would have done more if not for an unfortunate exchange in which the model called me out on my shoddy music

  theory knowledge in a vaguely passive-aggressive manner that strained our relationship. I've dug up the comment. Here

  it is: "It's worth noting that what you've labeled as a 'deceptive cadence' in measure 47 (the particular number was a

   hallucination) is actually a plagal cadence (IV-I)." This note from the AI was a perfectly reasonable comment,

  considering how a misunderstanding of this kind can significantly impact the feel and resolution of a musical phrase.

  A person with inaccurate information should always be disabused of misapprehensions, and it is true that I did end up

  learning that a deceptive cadence typically involves a V chord moving to something unexpected (often VI), creating a

  sense of surprise or delayed resolution, whereas a plagal cadence (IV-I), often called the "Amen cadence," has a more

  settled, conclusive feel, though it's generally considered less strong than a perfect authentic cadence (V-I). So I

  should not have let my fragile ego hold me back from continued guidance from this expert. I am not above feeling

  ashamed even if the ridiculer is a specter. For what it's worth, now that I see that I have retained this bit of

  theory, I am feeling inclined to strike up another conversation with that one. I think I have forgiven.

  And those were just the language models. In terms of tools, though there were many, more at first but I ultimately

  whittled things down, I experimented a bit with an early text-to-video tool and a handful of other video generative

  models, open source and closed, but I ended up relying on one diffusion-based video generator until I ran out of

  credits. I used a text-to-image diffusion model for generating images, and a generative upscaler for occlusion and

  added detail. I used a range of AI tools for interpolation and chroma keying. Photoshop for some inpainting and

  editing. The iPad application Splice for a handful of needs, Adobe Audition for sound, and Premiere Pro for the bulk

  of the editing.

  There are many AI tools for sound effects generation, some available for free on Replicate or Hugging Face, others

  with higher quality but more expensive than I was willing to pay. Besides, I had already conceded that the nature of

  the video would be reminiscent of found footage, lo-fi, compromised, and this would preempt the need to overspend on

  AI upscalers. So there would also be no need to break any banks on account of footsteps or murmurings. And again,

  rather than having to stop to search through a sound bank for something in the general arena of what I had in mind, I

  could coax into being precisely what I was imagining and in real-time. Initially, I started with simple instructions

  for generating sound effects, but after what seemed far more error than trial, I fell into an extensive volley with

  the sound advisor until we determined that the problem was not the models or even the sound quality. It was the

  prompt.

  I showed the AI a few drawings and some photos I had generated to start the project to give it a feel for the vibe. I

  said, "Imagine a derelict Victorian greenhouse, surrounded by withered plants and some still growing. There are hints

  of gothic horror and nostalgia, and nods to Surrealism and whimsy, theatricality, the absurd, macabre, eccentric,

  flamboyant, dreamlike and a little overwrought avant-garde… there's an array of automatons that people the space and

  may have been in the space at different times. Some may be made of porcelain, others may be mostly wooden mannequins,

  still others are more like puppets. In the distant past, there may have been automatons that played croquet here or

  something that looked a bit like that and involved sticks and balls that roll in the mud… There are some who may have

  been employed to keep the grounds. There are some rather young-looking automata, children it would seem, who appear to

   have used the orangerie as a place to hunt mechanical birds. Later, perhaps very far into the future, there are teams

   of figures with wires and helmets and some in quasi-hazmat suits who appear to be digging or fumigating or perhaps

  using metal detectors or detectors of some other kind. These figures too appear to be automata or maybe animatronics

  or even robots. They seem to be looking for something, trying to uncover or excavate. It is possible that they are

  trying to understand the history of the place. It is possible that it has something to do with a lineage… I have a

  bunch of silent video clips to which I will want to add sound. But I am not interested in literal sound only. Though

  we will generate those too, so for instance, if there's water in the scene we may generate drips or flowing water. But

   I plan to work on the atmosphere as well and this will require that we think more abstractly. Here's an example of

  the kind of prompt I mean to use:

  "An abandoned aviary at night. The once-unspoiled glass dome is now fractured, allowing cold air and little tendrils

  of fog to seep through. Ancient empty metalwork cages hang and sway. The fog is filmy, and it creeps along the outer

  walls of the building leaving droplets of condensation on the frigid outer panes .. these little water beads

  continuously burst and make faint popping sounds. There is a strain and groan of rusted metal… and a protracted

  creaking–like the noise one might expect to hear at sea on an old wooden ship, a crackling wheezing sound like someone

   or something… maybe the aviary itself exhaling centuries of fatigue. Where there had been birds and whistling and

  flapping there is now a soft susurration of invisible insects and shivering vines. Occasional crystalline tinks

  punctuate the air—tiny shards of glass, surrendering to gravity, they fall and land sometimes in thorns, sometimes in

  the dirt and sometimes they shatter when they hit a part of the tile ground where it is still intact. In the distance,

   and on both sides, there is the muffled thrum of an approaching storm and the air is electric."

  And rather than treating the tools as simpletons and asking for wind or rain, rather than worrying that they may be

  limited and would not understand if I asked for more, we might as well aim higher. A period of experimentation ensued.

   We settled on Eleven Labs because it seemed the most responsive or willing, as it were, to entertain some fairly

  specific and esoteric requests and more along the lines of foley files (Jack Foley, who pioneered the recording of

  everyday objects as stand-ins for observable audio)… but whereas foley may have been after realism, that was not my

  priority. My AI sound colleague was not on board at first. It took some persuasion. It is hard to make a good student

  recognize the value in breaking rules and doing things the wrong way, optimizing for impression rather than

  description. I ultimately prevailed though in convincing my adviser that the sound of Brillo being wiped over a

  steaming coarse surface in circular motions is indeed suggestive and evocative of a certain kind of wind and pending

  storm tangled with latent associations and thereby more effective than the merely serviceable "wind" files we

  initially prompted. This AI ultimately evolved to accept, and even actively recommend, such counterintuitive thinking.

   For instance, once it got going, it seemed to fully grasp the concept of non-literal sounds and descriptions of

  abstract atmospheres to achieve desired results. Here is one prompt suggestion the AI suggested once we had things

  moving along that took things even a bit beyond what I had in mind and yielded some of the most interesting if

  entirely unusable results to date.

 

All along the way, the AIs were at work. Everywhere and all around me, toiling and tinkering. I spoke out loud to some

   on my phone, others I texted on my iPad, and I had long sustained threads with others, open on multiple windows and

  tabs on my computer. I had three monitors whirring the whole time, with code running in the background. APIs were

  constantly being called, models were processing inputs. Some systems took longer than others, so I would often

  delegate multiple tasks in succession, beginning with the hardest tasks or the least nimble AIs and working my way

  across the screens until everything was buzzing and generating, doing whatever it is they do after the input and

  tokenization, when they are performing inference, matrix operations and parallel processing and other forms of what

  may as well be magic.

  I pictured them like tiny homunculi racing across vast neural networks, finding patterns, crunching data,

  synthesizing, and coming back nearly instantly with no sign of panting, delivering results that were sometimes perfect

   and sometimes usable but mostly somewhere in between, and in all cases impressive. Humming processors and flickering

  status indicators transformed queries into actionable results, vague and barely articulable thoughts into palpable

  information… knowable, seeable, hearable.

  All collaboration is a form of distributed intelligence, but this felt more personalized, more catered. I was not

  having to yield to the perspectives of others; the decisions came down to me. There are many instances when

  collaboration with other people is ideal, but it would seem to me that the fear that working with generative AI is in

  some way ceding control is not at all the case. If anything, it's entirely indulgent and self-serving and even

  borderline megalomaniacal.

  It is a good thing too that these tools are not above mentoring no matter how expert they may be. My video editing AI

  was always willing to spare some GPU or a gate array or two to answer some inane question that I might have been able

  to solve on my own had I realized that I was viewing the screen at 120%. But it did not matter; AIs have patience. And

   patience would be needed because Premiere Pro is a notoriously clunky interface with a slew of issues ranging from

  regular crashes and sluggish performance to an overwhelming array of confusingly organized features and poorly named

  tools. There is an inexplicable lag in sub-effects functions and convoluted keyframe animation controls that misbehave

   badly if one fails to linger long enough on the correct combinations of shifts and ALTs and CTRLs. There are virtual

  dials and sliders that get stuck or seem only to be all the way up or all the way down, even when the snapping

  function is turned off. Occasionally, some click or clack will spawn all manner of pop-up menus with granular details

  and side interfaces that one had not intended to open and that pose existential dangers since it may have been, seems

  always to have been, hours too long since the last manual save. My machine was taxed to such an extent that I spent

  the better part of two afternoons digging through autosaves and cached files, futilely trying to resurrect iterations

  that had succumbed to the brain fog of my taxed RAM or some type of hard drive heart attack or an ailment of the GPU

  or a passing fit of Bartleby-like resistance. Not to mention that on a Mac one can easily export using ProRes, whereas

   on a PC, to accomplish such a task, one has to jump through hoops with third-party codecs.

  But not to worry, I had my video editing AI guide to help me troubleshoot and sometimes, when possible, to advise me

  through reassembling a vanished version of the project from bits and pieces. And here and there, it would have to

  break the news that a given (and eminently sensible) feature that was very much a staple in software from the 90s,

  such as Vegas Video, is nowhere to be found in Adobe's product line, and that would require me to complete the task in

   the most tedious way imaginable until such time as an AI or a feature should come along and make it possible to

  bypass the need altogether.

  There is a lot that I feel like sharing about this project. It is not just the story of the acquisition of access to

  skills and knowledge to a person with mere will and an extremely modest budget. It is a case study, a proof in point

  that, while we, or at least I, have doubted the ability to experience anything like the type of flow state or creative

   zone that one can achieve through writing or painting or performance with generative tools, one can, and one did. And

   while I thought I might make a 30-second composite video on a whim, the whim, which began with a vague notion, grew

  into an idea and onto images, then animated images, and a process of generating sound effects, and then aligning

  sounds with images, and stacking images and sounds, multiples and overlays, and changing speeds and doubling or

  reversing for impact, and refining layers and effects and color and sequence, carried me to something like a 7 1/2

  minute video with north of 50 hours of work beneath the surface.

 

It is not about this project per se, it's about the fact of it. I know that it is not unique to work for a long time

  on something or to learn on the fly. Or to bang one's head against the wall trying to recall the name of a tool one

  has just used but cannot for the life of them remember. Or to slave away at manually adjusting gain and effects

  panning up and down and left then right then left again and so on to enable just the type and timing for the pinging

  volley one has in mind for a given sound to go along with a particularly erratic shot.

 


That is not unique. For all I know, YouTube influencers spend the same kind of time and effort preparing to talk to us

   about sneakers. I am not saying there is anything special about the work I have put into this, the number of files in

   the project (698), the number of edits (a bazillion), or the tragedy of the promising bits (too many) that ended up

  on the cutting room floor. Here is one I grabbed truly at random from the bin of scenes that did not make the cut. If

  I had any inclination that others might have the same stamina to indulge me as I have to indulge myself, I would have

  made this little project a wordless feature-length number.

 

I am also not saying that individual AI-generated works of music, art, and design are not enough or do not have merit

  on their own. I am saying that one can exercise a certain kind of organizing vision, to make something from parts even

   if that something is non-linear and fugue-like, and only arguably visually coherent. I can attest that while it may

  not meet the standards of another, I am convinced that I have been able to maintain a voice, an approach, in spite of

  the piecemeal nature of the process and in spite or maybe even on account of the range of tools and visual languages

  at play. I am also saying that the impulse to work on this project is indicative of a broader shift that some may be

  experiencing and others may be observing. We may be shifting away from artists producing discrete works to artists

  leveraging the voluminous output of generated material toward a curatorial production-like result. This mode of

  working is not new. In the distant past, there were art groups that split the labor of creation. There were manuscript

   illustrators who worked in teams and answered to lead illustrators who shepherded decisions pertaining to palette and

   the design and the nature of the mark-making. In the Western Renaissance, there were teams of apprentices serving as

  vehicles for the realization of concepts conceived by one or a few individuals. We are, of course, aware of symphonies

   and bands and movies with directors and blue-chip artists who have scores of assistants. Duchamp helped establish the

   notion that art, even of the more physical brand, need not be the output of the artist and in fact may comfortably be

   any object whatsoever so long as the individual who has exercised discernment declares it so.

  As a painter, oil painter by trade, I have no plans to exchange my brushes for algorithms. In fact, I am painting as I

   write. I am dictating. But alongside my typical work, from here on out, when I have a long weekend or a sleepless

  night or encounter a new tool that needs exercising, I may very well steal away again and hammer out another video or

  two. And maybe next time I will do so with less derision and less doubt because I have seen that these instruments are

   creative vehicles like any other. And they have their place in the pantheon of tools at our disposal.

  Note that although I have steered the ship of this particular project, it more than most other types of artistic

  production is very much a group effort. And I do not just mean the language models and image models and video

  stitchers and tools I have employed. It is on account of the millions of photo takers, artists, and others whose

  output serves as the visual memory bank these models rely upon for their knowledge about the nature of images.

  Also, it is important to note that while I am now convinced that it is plausible to experience creativity using

  generative AI, this does not mean it is ethical. That these models have scraped all they have seen, and that no artist

   or imagist, as far as I know, has seen a cent of compensation for their contribution to the collective visual mind,

  is positively suspicious if not genuinely illegal.

  I am not selling something, though, and I am eager to acknowledge my use of AI in this production; I have not sought

  anything other than to see this project through, for the sake of it. And in this, I feel justified.

  I have not taken away the job of a gaffer or a writer or a sound editor. If not for the advent of AI, I would not have

   made any video at all. I have not stopped paying my cast because I never had one. The spaces in this project may be

  dangerous. I am not entirely sure of the nature of the particulates that are being sprayed in a handful of scenes.

  There is broken glass, bits of ceramic and shards of porcelain everywhere: a lawsuit in the real world but entirely

  unthreatening in the space of the hypothetical. So it is all very well that everything and everyone is conjured, from

  the automata to the panes of glass. We can take comfort in knowing that no animals whatsoever were harmed in the

  making of this production. If on a cultural level, we aspire toward increased quality and diminished suffering,

  generative AI may indeed have an unexpected and de facto impact. This example is not scary exactly, but it is

  unsettling; it traffics in gothic tropes and makes nods to a number of horror sub-genres. Such work for an actor could

   be destabilizing, not to mention mentally or physically taxing. But today no humans, no robots, no Victorian automata

   have suffered an iota. They are pixels only. Down to a pinkey.

  Contrary to the misconception that AI-generated images are mere collages of existing content, generative AI systems,

  including the ones I have used, create entirely new visual data from nothing, synthesizing unique images based on

  learned patterns and statistical relationships, resulting in original images that have never existed before and cannot

   be traced back to any specific source material. Whereas traditional curators cull from existing works to conceive a

  vision for a show, or editors search out work to collect into a compendium, this new curatorial approach occurs

  virtually at the speed of thought. When I encountered a passage in which I wanted a croquet ball that appeared to be

  hatching like an egg, no such image existed that I or the internet was aware of, so we generated it. It took 30

  iterations and some edits to get what I imagined, but it happened at a relative snap of the fingers. I did not have to

   hire a photographer, look through collections, or even spend hours compositing in Photoshop. Instead, I guided the

  process with descriptions and adjustments, working in plain language and through discussion, as if conversing with a

  production designer or set dresser. This process allowed for rapid iteration and refinement, bringing into being

  images that previously existed only in the realm of imagination.

  In these early days of AI, when models are still in the single digits (version 1 of this and 4 of that), it is all

  wonky. With the video AI especially, it is beyond wonky; it is harrowing. There are distortions that verge on the

  demonic-looking, and things can get so uncanny that the hairs stand up on the back of my neck. The clips that are

  generated are short, four seconds at the most, and the more one extends them, the more they degenerate. So, I have

  decided to lean into that. I am okay with creepy. I do not mind a little disturbance. It may even be true that it is

  exactly in the spaces between what is familiar and what is not that art can sneak in.

  Since I am drawn to the slippage and have sometimes preferred opening credits to an entire film simply because of what

   they imply and how great a role restraint plays, and just how much is asked of the viewer to project, to wonder, and

  to engage, it may be all for the better anyway. It is something like being a detective, with thousands of scraps of

  information that might fit together in a seemingly endless number of ways. But one takes it clue by clue, even if the

  questions lead only to more questions. So, I will wait to make pleasant movies and films with linear arcs and let the

  process lead the way.

  My favorite movies are not movies. They are scenes from movies, parts I cannot understand, bits and pieces that are

  suggestive but ultimately opaque. I do not know what the director intended. I would not understand it if they told me,

   and I do not think I much care.

  

  Everything here is AI-generated. The scenes, the camera movements, the sound effects, the music. From top to bottom, I

   made nothing. And I made everything.

Previous
Previous

Kindred Sprit: A Literary Style Mixer

Next
Next

The Ballad of Fred Bjontik