BioHackers Podcast

Welcome to the BioHackers Podcast! Every few weeks, David and Alex welcome a world-class scientists to explore interesting topics within the AI and BioHacker movement, including Generative AI, Chaos Theory, AI in Biosciences, Climate Change, Advancements in EdTech and BioHackathons. BioHackers are a community of digital biotechnicians dedicated to solving many of the world’s toughest problems by closing the data-driven R&D workforce demand gap. This ecosystem of skilled data workers work together to mitigate climate change, inequality in education, cancer, hunger, cultural resilience, and global digital transformation.

All Episodes

BioHackers Podcast

BioHackers Podcast Ep. 3 - Biodata, AI, & Workflows featuring Ben Sherman

June 28, 2022 • David James Clarke IV and Alex Feltus featuring Ben Sherman • Season 1 • Episode 3

Welcome to Episode 3 of the BioHackers Podcast!

In this episode, David and Alex welcome Ben Sherman to the show. Together, they discuss the role of AI in finding patterns in data, the magic of workflows, image analytics, the top 3 skills of a computational scientist, and how to become a Groovy coder.

Watch the Video Podcast on YouTube: https://youtu.be/Q5kSRbg2w1M

Here is a list of topics:

Welcome to Episode 3 (00:00)
AI Eyes Find Patterns in Data (02:45)
Feltus + GTA + AI = Tumor Genomics (06:00)
Welcome Ben to the Show (07:30)
How Did You Get into AI and Bioscience? (11:32)
Magic of Workflows | Ben’s Dream Job (15:30)
Workflow Managers = Next Generation of Code (18:28)
How Hard is it to Use Workflow Managers? (21:14)
“Groovy” Nextflow Programming Language (25:49)
Nextflow Tower is an Onramp to the Cloud (29:38)
BioScience Research is Troubleshooting (31:50)
AI Applications for BioSciences (36:04)
Image Analytics is Common Denominator for AI Applications (46:04)
Top 3 Skills to become a Computational Scientist (48:09)
What is a BioHacker to You? (54:30)

Enjoy the Show!

0:00

hello and welcome to another episode of the biohacker podcast I'm your co-host David James Clark IV here with the esteemed Alex feltus PhD how you doing today Alex excellent how you doing David doing awesome thank you I'm really excited about today's Triple Play topics uh and before we get to that why don't we cue the opening laughs all right that's fun okay so today's topic uh with Ben Sherman is biodata AI and workflows and I called it the Triple Play because it seems like those are three kind of separate unrelated topics but I know in some way you and Ben are going to share with us how they're related of course if you think about those three the one that jumps into my mind and I think most people are engaged in on a daily basis is AI ai's everywhere I I just went to the city last weekend and got a hamburger made by a robot uh so no human hands touched this hamburger and it was actually very good um I think one of the most common uses of AI is Siri or Alexa uh that's becoming more and more prevalent one of the uh the applications of AI That's near and dear to my heart is learning as you know I'm a big digital learning guy and AI has impacted learning in a variety of different ways somewhat you know some around conversational chat Bots doing coaching uh during The Learning Experience assessment you can have ai facial recognition or AI you know transcribe uh some video feedback and score it believe it or not of course AI curation something you've used in your classes the idea that AI goes out using machine learning crowdsourcing Big Data finds the best content delivers it to the Learners in real time so let's start with learning and then we'll move on to science what are some of your favorite applications of AI and learning definitely this this content curation is is a really cool thing and it's it seems very early to me even though it's very effective like you get really neat content but if we can get AI to the point where it starts to generate the quizzes and and generate labs and things like that I think it's going to be even more effective but you know one thing I look at AI is like it's a very small amount of code that you can write and you can learn how to code it in a very short period of time and it gives you this like tool this power this new set of eyes to look at data and find patterns in data like visually but also like statistically and it's really very much common sense it works kind of like our it's designed for our brain the little we know about our brain it's very functional it's just anything you can do uh from my perspective you know learning or or science if you can see a pattern you can get AI to learn that pattern I mean AI is driving cars right yeah you know it's funny but funny you should mention that one because that's exactly what was going through my mind I took that crazy uh Andrew ing AI course uh and you know got my AI certification uh which means very little uh but in that they talked a lot about how AI drives cars and the pattern recognition and and you know I'm scared to death obviously to drive in a self-driving car but I imagine that over time it's going to get better and better uh and I know um AI one thing I learned from that class is AI relies on Big Data there has to be a lot of data the more data the better the more data the smarter the AI so I imagine that's where the bio data piece of today's episode comes in is it's AI uh working on chewing on large data sets from biosciences is that right yeah or you could say large data sets which is totally accurate like large for the computer from their perspective but also just the complex systems I mean there's you're trying to see the weather and you're trying to measure the weather and you can't you know use a piece of paper in a slide rule anymore you have to have computers and AI to be able to process it and it's a it's a it's a real Quantum Leap because when I first started using the lab and I'm not a computer scientist I wasn't trained and all that kind of stuff we had I work with people um that do that but I was able to see patterns in my data just by visualizing it like trying to separate how different things are from all their differences and just seeing things like tumors and in normal tissues and different types of tumors and how progress the tumor is you can see it in the data so it helps you like go from like I have no idea what the data I'm looking at because it's so big to I can see there's differences so explaining those differences kind of goes back into science hypothesis testing and also kind of it really one of my favorite Tools in science is common sense and like these things are different what is it you know just trying to figure it out that's the Alabama enemy yeah you know it's interesting when you're talking I'm thinking about the self-driving car and one of the key things there is being able to I you know scan the environment and identify patterns you know what's a tree and what's a human and what's a road um and I imagine that same concept applies when you're you know scanning big data sets biological systems and identifying different cell types and and and looking for patterns pattern recognition yeah I mean actually um so uh Our Guest uh Ben Sherman he's somebody I met years ago and he was in a lab where they were doing self-driving cars a bunch of computer engineers and as a lab that I work with the person for a long time the pi the main principal principal investigator of the lab and I walked into the lab one day and they were at Grand Theft Auto playing on a screen which is this video game right where you go and do all sorts of stuff and kids you should you have to be like you should be have to be 40 plus to be able to play this thing but I know everybody's going to be playing it but they had a camera watching the the video of a car driving and they actually were using AI to control like the controller like the handheld controller you would use in the video game they bypassed that and they were controlling the car with the camera and and this is you know back when I first met um met Ben and some other people in the lab if a month maybe a few months later one of the students in the lab said well we there's a way to trick the self-driving car to where it doesn't know that there's you know empty road in front of it so it just veers to the right and the left and crashes and it's something that some people have developed to be able to hack into some of these artificial intelligence models and he's like well could we do that with normal and tumor data like where the normal data has been tricked into being tumor and just do the exact same thing where you treat the data like an image instead of like you know an image of what's in front of you and then we did that we published a paper on it and we're actually publishing multiple papers on it it's a it's a very um practical Common Sense type of of science that I could never do without these tools that I I guess I understand about halfway yeah but well our guest a great transition because our guest Ben Sherman understands them a little more than halfway so uh let's welcome Ben to the show hey Ben thank you so much for joining today we've got Ben Sherman here who's uh excited to talk about biohacking and and specifically for today's episode bio data Ai and workflows so before we get to those three and what they mean and how they work together Alex you want to do a quick introduction of our friend Ben uh yeah Ben is a a double renaissance man I've known him for several years um going back to when he was an undergrad and uh he's he's an amazing person that's able to do hardcore software engineering and artificial intelligence work but also he's a great people person and can talk complex science with people and translate that patiently which is sort of the key property that Ben would agree with that um and so just uh amazing things and so he's he really is uh you know I look at Ben as kind of like a third quarter of the 20 21st century type of scientist oh wow well said um Ben your rebuttal to that well that was a wonderful introduction um a lot better than I was expecting so normally he would just berate me for not being more productive in the labs that's right we love Alex that way so you've been you've had some pretty exciting uh activities this year I think in the last 12 months if you want to talk about you know what you've been up to because I think you did the trivecta here all in one year yeah yeah so in the last year or so I I got married my wife graduated I graduated defended my PhD uh right as I was doing that I I was also interviewing for jobs and so I found a job around the same time um and then my wife and I you know we'd both grown up in South Carolina um but we decided to move out to Austin Texas because my job is remote so we moved out to Austin Texas at the beginning of this year and we've been there for we've been here for the past five or six months or so and um that's about all I've been laser focused on is like man we did all that I'm ready to just uh relax for a bit and just enjoy the ride for a little while before moving on to something else um and uh in previous years I'm also uh a musician that's probably my main thing aside from just uh being a being a software guy um but PhD and pandemic kind of um threw a wrench in that it's it's hard it was hard to to play with people when the pandemic was going on and and that's really the funnest part about it is when you have other people to play with um a lot like science it's very collaborative in nature um and and also I just had to focus on other things you know with a PhD so um I've been that's been on break for a while but I'm hoping to get back into it once we've maybe settled down a little bit more here you know met some people found some I don't know found some jazz clubs to go to or something people keep telling me that there's a huge music scene in Austin I just haven't uh haven't explored it yet I'll tell you what I've been there plenty of times I know some musicians uh who absolutely love the place uh you know they talk about Nashville you know being the the center point of music I think Austin's probably the the center point of live music uh and uh there are lots of clubs and lots of great opportunities for just about everybody there's open night mics all open mic nights uh all the time uh oh you are gonna love Austin and as you were talking I I pretty much figured that's why you moved to Austin I'm sure that entered into your uh into your thinking I'm sure that was a factor yeah so um so the question that that's burning in my mind and maybe you can talk a little bit about you know your story and how a software engineer gets into Ai and then you know bioscience and how you and Alex kind of ended up together uh because it's really a cool story and I know this paper you're working on I want to talk a little bit about uh the the cool use of gaming AI um in in cancer research but um tell us a little bit about your story and how you and Alex uh hooked up how you got into biosciences yeah it's it's a fun little story you know I went to Clemson for my undergrad um like I said grew up in South Carolina so it was um it was an obvious choice um and then I stayed for my masters because I found a lab that I would enjoy working in and it and this this was uh Dr Melissa Smith's lab she was doing a lot of work in high performance Computing and and also starting to get into AI um and around the time I started Alex came to our lab and basically just presented on what he was doing he was like hey here's all these different projects we're working on um we're always looking for people like y'all to come and help out with the software and Engineering aspects and um I really oh it's my buddy Colin um for for a lot of things really he was the one who got me into the lab and then he was also the one who was like who was actually like hey let's go and talk to feltis about this um and so I went with him and that was sort of how we got into the lab um you know and so we did that out for about a year and then after about a year you know I was planning to do a masters I decided to switch into a PhD um and honestly the main reason was that I wanted to go get hired by Nvidia so we had a lot of guys in our lab who had graduated and moved on to Nvidia and they actually gave us a tour one time of one of nvidia's fancy new buildings they built this giant triangular building and it was just really cool and they were telling me about all the different work they were doing and it's like man I really want to work for this company and it seems like they hire a lot of phds so I think we'll have to get a PhD and then maybe they'll hire me and I knew at the same time that you know I had been working with felthus so I knew that there was going to be plenty of stuff to work on right there was going to be plenty of interesting projects I wouldn't have to be do like the the typical like PhD nightmare where you're sort of Meandering for six years looking for a dissertation topic um I think that the topic I ended up doing I think Alex and I first talked about that within the first year or two of my PhD um and then that was sort of remained the dream and the goal um from then on and so yeah I stayed for my PhD um I think I ended up finishing it as quickly as I intended um about four years uh four and a half years I think code would probably delayed me uh by about a semester that and getting married that was awesome and that's kind of important yeah we uh we finished it and I did actually get an interview with Nvidia and got an offer from them but around the same time um I also got an offer from this other company called Sakura Labs they're basically a startup around a a certain open source project called Next flow which we had been using in our lab um and it was not something that I really thought would um ever actually happen but the the funny thing was is that they um the the topic that we settled on for my PhD topic um was actually something they wanted to implement in their product and so basically once I got my first paper out on that topic I just emailed it to him just kind of it was kind of a long shot honestly I was like hey I know you guys want to build this out I'm actually working on this why don't you uh hire me and I'll I'll do it for you I'll build it for you just just kind of out of the blue um I hadn't really spoken to him that much and they were actually like yeah let's do it um and so I end up going to work for them um basically a dream job you know you get to actually uh apply your PhD work into a business context and um you know work on next flow which was a very a very beloved software project that we used in the lab and so I'm just I'm just having a good time man just hanging out in Texas working on next flow playing in the jazz clubs I love it yeah yeah living the life man you're the Renaissance Man uh okay so talk about next low talk about the role I mean for me the you know we talk about biodata Ai and and and workflows and and earlier Alex and I talked a little bit about AI everybody knows about Ai and of course you know data uh is the fuel for AI but people I don't think fully appreciate the value of workflows and and you know the magic that happens next flow being as you describe the open source application of that so talk a little bit about workflows and next flow and and the role they play in bioscience yeah it's honestly it's hard to put into words like the the the experience of discovering next flow and it's not just next flow it's it's workflow managers in general of which next flow is is one of the more popular ones but there's this like uh there's there's this thing that happens when your computational scientist and you start learning about all these different analyzes and tools that you want to use and usually what happens at first is that you just write a script that change all these tools together um and it can be a really big hassle um trying to get these these giant scripts to run on super computers or especially in the cloud like forget about the cloud like that's yeah that's just not going to happen in this situation right and then you come across a tool like next flow um and there's others like snake make and biomake and wdl and all these other ones but just next one particular we'll focus on that where it's like here's a language where you can Define um hear all the different different steps I want to run in my analysis here is how all the steps depend on each other right so you know you run step a and then you run step B and C and they both depend on a um but they're independent so you can run them in parallel you know try try implementing that into bash script right yeah right you got D and C no so on and so forth um and you just Define all that you just define basically like it's like a sewer system like a pipeline um and then you just you just say next flow run and you give it your your laptop or you give it your a super computer or a cloud platform and next flow just starts it just starts running it's like an engine um and everything it just runs um usually not the first time usually takes a couple times to get it right but then you get that thing working and it's it's just beautiful um I I really think that next flow and other workflow managers like it are really the the next evolution of of how we write code um because the way we currently write code is is basically like a recipe right like just a single um sequence of instructions like do this and then this and then this and then this we call it imperative programming um imagine that you had a giant recipe um that was very complicated and you had it all written out one step at a time and then you had four people come in the room and say hey we want to help you you know with this giant uh recipe that you want to cook um is it immediately clear how you can split up the work it's not at all right because you just have everything in one line what if you actually had a flow chart describing sort of not the data flow but the food flow the ingredient flow like okay you start out with these ingredients and then you you do this operation and now you've got some bread right and then now that bread goes through that and then you have this flow chart well now it's actually very clear how much parallelism you can get uh out of out of that and so now you know like okay if I have four people I can split them up in this way and that's sort of what workflow managers do for code is that it allows you to describe the inherent parallelism in your analysis and then that way you can like I said you can give it a laptop or a supercomputer or Cloud platform whether you've got one core eight core or a million cores CPU cores and it just automatically scales um to the compute capacity that you have um I'm having trouble even just describing all of the different ways that it revolutionized you're doing a great job that's probably the main thing um just the expressiveness and sort of the automatic parallelism um and my hope is that honestly that this this Paradigm of programming can actually be expanded to all levels of coding um not just like these very high level analysis writing but even when you're writing um you know the tools themselves when you're writing in C plus plus or Java or python whatever language that you're doing that we actually develop ways to describe code more like flow charts rather than recipes um and that I think will both be a lot easier a lot more accessible for people who aren't um experienced programmers um because it allows you to sort of think more like a human rather than a computer and then it will also be better for the hardware that we're using is that as the hardware gets faster as it scales out more our code will scale up with it so yeah that's all I'll say for now do you could you uh what's your opinion on like how hard it is because like writing code like means is that's like an impossibility for some people that never done it and you know that has different uh meanings I guess for different people but like a workflow manager next flow for example like the it's writing code right but it's not like is it harder than writing like a application or is it more intuitive I love the recipe concept too and that's basically what you're doing is ordered events yeah I mean I think the ideal is that it's it's much more declarative basically which is why I think of it as being um much more human friendly than machine friendly because when you write imperative code like traditional code you literally have to think like a computer you have to be like okay if I'm a computer running this program I'm going to do this and this and this and here's all the different variables that you're changing it's very difficult even for experienced programmers but when you're using a workflow language the difference is that it becomes declarative so you're not really writing up a recipe you're more like describing like here is how this thing flows to the next thing it's it's very similar to like if you were describing like a physical structure um um saying like Okay you know like describing like a I don't know like a a building or a a network of Roads or something like like okay this road goes through this way and then these two roads connect here and stuff like that it's it's just like that you're just describing um this task and then the next tasks and the nest task and here's how the data is flowing through all of that um and it's much closer to if you were to ask a scientist like okay um here's this big analysis that you do all the time how does the analysis work they would say okay well you start with this first step and then you go to the next thing and so on and it's a lot close to that than writing literal code um I hope I answered that question yeah I jumped into that so you've worked with a lot of different fields right like you've worked with me in biology and other other fields like do do you find that you probably work with a lot of people too since you've gotten your job at that but do people know what they're doing a lot of times like like it seems like I this is something that really the workflow manager helped me understand what we're doing like I don't add no parts of the recipe and I have to go look at different pieces and throw it together and make it happen but then like now I know where I can optimize and things like that do you find that people or as dumb as me me knowing what they're doing or inefficient or just I don't know ignorant of what their workflows are on the computer you know in my experience it seems like the path that most people take like the path that leads most people to tools like next flow is is like I said earlier they the first natural instinct is to write a bash script because that's sort of the first thing that you learn when you're learning how to use you know like Linux and the command line and things like that is like if you want to automate something just write a bash script um and you know we don't really teach workflow languages as like a first order like tool yet yeah and so people sort of wallow around with in bashland for a while until one day maybe a colleague tells them like hey if you're the next flow have you heard of snake make um and then they discover and they have that aha moment that I was describing earlier like whoa this is like the next level um kind of work um yeah I I would say that I suppose that um the declarative styles of programming and really functional programming is really what it is um tends to be a lot more stricter syntax wise but it's usually a good thing because what it means it's there's more guard rails there's less ways for you to sort of screw yourself over um so you know a lot of times when you say you're trying to to write a pipeline you know write up the pipeline script um you know it'll it'll force you to really think about um what your analysis is you know what are the dependencies um between uh everything and uh and if if you if you don't think through those things and it just won't work like not it's not like it'll run and have like these silent failures that happen so often in normal code it's like right no your script isn't valid um you've got a problem here you need to fix this um before you even run anything right so in that sense I think it's it is more helpful for um for for Sciences for you know not um you know non-technical sorts of scientists um I will say though in the short term next one can be difficult because it's written in a in a in a what I think is an obscure programming language called groovy which is a uh it's like a scripting language based on Java um it is very powerful it's just a lot not a lot of people know it so a lot of times people come to learn next flow and they have to learn groovy at the same time whereas if it was written in you know something like python you know a lot of people are more familiar with python or even R you know a lot of people use R um and so there is a learning curve related to that um but overall I think it is a a very worthwhile investment you know you don't I'm not saying you have to learn next flow but I'm saying learn pick a workflow manager and learn how to use it um learning a workflow manager these days as a scientist is as important as like you know learning a scripting language you know it's going to be an especially important thing and to be able to tap into that like you know in my lab everybody's got to have know this on some level at least send some programming to get out and into biology but uh I mean do you how how accessible is this to people that don't know any coding like do they do they need to learn how to use the command line because I mean there's so many people that you know are used to swiping left and right on their phone I guess that's like a dating app kind of thing right or whatever but I'm old um but do you how much how much how how much training and and I'm sort of like loading the question because I have a strong opinion on this play how accessible is this stuff to people I mean there's no no reason to learn this if you're not ever going to use it but when you're doing science and research you pretty much have to use you know big computers now to process big data sets but you know is it how do you get over the it's impossible I can't do it I always think of like I can't do math I remember people when I was a kid saying I could never do math and that's sort of they can do math if they just do it yeah a lot of times it comes down to you know how you're taught people learn different ways there's there's a lot of really cool ways to learn math that probably don't get taught often enough that's kind of how I feel about workflow languages honestly yeah um I guess there's two levels to it on the one hand um if you want to learn how to write your own workflows then yeah you definitely um need to learn how to use the command line um in part because a lot of the concepts of workflow languages actually derive from how a lot of the practices that we use in the command line you know that really is the starting point for for lots of especially bioinformaticists right is the first thing you learn is like how to you know cat a file or how to how to filter it from different lines and search it things like that um so yeah I think that's a very essential thing um knowing how to like uh build C plus applications with with make and how to install these different libraries and get all that stuff to work I don't think that's necessary you know I think the uh the the tools should be very easy to install for people and nextflow is very easy to install all you really need is Java and then you're good to go yeah but but then beyond that using the actual command line I think is very important for that aspect there's another level to it which is that um there there is a set of a subset of scientists who who will never have to write their own pipelines they will only ever run pipelines that other people have written um so for example next flow we there we have a community called NF core which has basically been building out all of these sorts of standard bioinformatics pipelines that people use for example there's one called rna-seq that's just like sort of a Swiss army knife right RNA stick analysis A lot of people um you know they're not going to need to roll their own rna-seq workflow they can just use the NF core standard workflow and there's like a million ways to configure it so very likely it can fit your use case um and so the important thing there then becomes just all the things around that so you know how do you load up your input data how do you manage your your compute infrastructure actually run the pipeline and then get the output data from wherever it was produced and that's a big part of what Sakara Labs is doing with next flow Tower is sort of creating a basically web application that brings all that together so that for users that aren't writing their own pipelines that are just running workflows and just need the inputs and the outputs just making that as easy as possible so you have this one location where you hook up your compute environment your storage both for input and output and then you know you you can just you basically tell Tower like hey run this next flow pipeline here's here's where it's located in GitHub or gitlab or wherever run that and then you know here's here's a cloud provider that you can send all the jobs to here's an S3 bucket or whatever Object Store button or whatever that you can throw all the output data into and then when they're done you know they can just download it or they could you know do whatever you know they could spin up a Jupiter Notebook on top of that data and do whatever other analysis they want to do for them um it that it's possible for them to do all that without writing any code right because everything is sort of prepared for them um and so obviously that's not everybody like some of us have to write the workflows and a lot of people need their own custom analyzes and so in that case they have to pull into those those deeper skill sets um but you know for the people who just need to run stuff I think we can create sort of a happy path for them so do you think there's a lot of a lot of people like that because you know I always find that something breaks a little bit and then you have to have so at least some technical skills to be able to go and edit a text file right to make it work and even to the point of like real basic stuff which is very challenging for new newbie people in this area is just like how to find where the file is on the file system and sometimes you can just go and remap you know where that goes and like you know I don't know I just I just know that there's this I'm just this is one thing that practiced Ai and and I've been trying to do for years if even before practice the guys just enable people to be able to troubleshoot because that seems to be you know 80 of what we do when we're doing bioinformatics and computational biology research it's not like we're sifting through the results good or bad it's like we got to get results right the recipe you know we're missing an ingredient we gotta like replace it yeah that is sort of the hard reality these days of running these workflows um is that you know as long as next flow has been around and as many people have worked on it I still think it's very immature like I think most workflow managers out there are still very immature just because of how um I mean one of the biggest things is that you know next flow you can run almost anywhere which means that you know each one for every new place where you can run next flow there's there's a new Vector for ways that things can go wrong and break yeah there's sort of this black hole effect of you know people so I've talked about how you know people start out with bash they discover next flow they get super excited about it they start writing their own pipelines they start having issues usually related to like hey I'm trying to run it on my University cluster and there's these weird behaviors with it and so they get involved in the next little Community right and they'll they'll submit tickets and all this stuff and you know sometimes people have encountered their error before and they they say hey here's how you fix it sometimes they're the first one and so they're like okay well if I don't fix this myself um nothing's gonna happen and so slowly you slowly get dragged deeper and deeper into the community um both in terms of like knowing how to debug workflows knowing how nexo Works internally and then there's sort of the ultimate level is where you're actually contributing code to next flow itself this was this was my path I tried for the longest time to get things to work without ever having to contribute to nexo myself because I knew that if I had to download this this giant code base written in groovy and learn how to use groovy in Java again and get all these tools to work that would just be a whole other thing and I just I did not have the bandwidth to do that right because as you know felvis I'm already writing these like multi-node GPU applications and writing python code and jupyter notebooks and all this stuff I don't have the space to get into next flow contributing but then finally at the end of my PhD that's what I had to do because it's like well it's my PhD on the line now so I better make this work I better go learn groovy so that that's what happens you know a lot of people maybe may start out just as like basic users but as they get drawn into the sort of the next flow black hole and again you know it's not just next flow it's it's every workflow manager out there has its own black hole you slowly get drawn into like out just out of necessity yeah learning um how to diagnose things and how to fix things so um yeah there's that too I think anybody who gets into this field should be prepared for for you know going down that path so I'm I'm a little bit curious then then where's the magic right so you know we talk about bio data Ai and workflows and we started off with workflows and now I think everybody has a much better understanding I know I do about you know the critical role they play um is the AI you know is the Magic in you know is the AI in the in the workflow manager in the way that the data is being analyzed is it the the the the um as as feltsus would say the hypothesis or the experiment itself you know what data you're looking at I mean the workflow is a mechanism um but is there intelligence in that mechanism or is the AI and the bio data you know somewhere else yeah I would say that AI is is a totally separate uh technology that we use aside from the workflow managers okay so it's not thankfully yes thankfully there's not really any AI involved in workflow managers because that would be a nightmare I think you know workflows are complicated enough as it is yeah but they are certainly related um in many ways so with AI um at least from my perspective having been like you know a software machine learning engineer who joined the feltis lab basically seeing you know the different kinds of uh scientific questions that students that these students in the filter slab are trying to answer and basically seeing like okay well I've used these AI tools like I've trained neural networks to like classify different types of images um this was like the first thing we thought of immediately was like well why don't we just take that same image classification Network and apply it to uh genomic data just classifying different types of genomes right and we immediately saw like okay well there seem to be different types of cancers out there they they have different genomic profiles um maybe we can train a classifier to tell the difference between them and that that part was actually really easy I mean that was that happened almost immediately it's like yeah it's actually very easy to distinguish them but then the next question became like okay what what are the factors that are causing these different tumors to be different can we find like specific you know subsets of genes that are like you know accurately describing the differences between these tumors maybe those genes um are are the reason why the tumor is developing in the first place you know maybe that's that's a potential pathway not not a 100 answer but it's like you know if you can narrow yourself down from 50 000 genes to ten genes to explore well that makes a scientist's job very easy so it just it immediately when we joined the lab when Colin and I joined the lab it's sort of open there was these floodgates that opened of like all these different sorts of questions that we could explore using AI tools um that's that's the main way so using AI tools um basically in the same way that you would use traditional statistical tools to try and extract insights from your raw genomic data except now these AI tools can handle a lot more data um maybe handle more complex questions but ultimately you're still kind of using them in the same way um and then uh there's also I guess the other side of it is um well not the other side but just the other thing that we would explore in terms of AI was using AI to um to Aid in the actual execution of these analyzes so now we get to the relationship between AIS and workflows is that well you're running these workflows and you need you know this is so again here was sort of the Gateway question was like okay we need compute resources we need CPUs and memory and storage and time to execute these workflows um but we really have no idea how much we need you know it's like okay well here's an input file that's 10 gigabytes um how long will that take for my analysis well I have no idea um so what we did was like okay well let's just run some analyzes a bunch of times and collect the actual performance data that the usage data and then sort of use the pass to predict the future um and so that was that was the first foray of AI with workflows and and again there's sort of there's another pair of floodgates there with like okay well you can use AI to um you know to recommend resource settings or maybe to detect if and out if an analysis went wrong somewhere or even to optimize like the scheduling of resources so you know if you have you know a university cluster and a cloud platform and three different Cloud platforms you know maybe your AI can sort of schedule your workflows across your different compute resources in an optimal way there's all sorts of questions like that that also open up so I would say those are the two main connections that you had with AI there's two different aspects yeah I think that one thing to add with that just as a practitioner of this is that the data sites are huge and they have to go into big computers the computers now a lot of them are distributed they're not like one like a laptop or everything's done on one computer they're networked together like campus computers in the cloud computer and so you need this kind of you need workflow managers to be able to throw jobs at these resources that can be very expensive especially if you're using like the commercial Cloud providers where you can with one experiment one genetic experiment I could I could give me five minutes and I'll design a genetics experiment that I can spend five million dollars on in a day and then a lot of times you do I'd say if you're lucky like five percent of your experiments are useful to go to the next step they're all useful on some level and so I mean think about like if I were a small lab in a university in South Carolina that you know if I all of a sudden I get a billion dollar bill you know at the end of the year without having any management of my resources and AI can do that management then you know I'm going to get fired I'm not going to be able to do any any science and so with is talking about being able to have AI Control the flow of data on finite resources the data sets get ginormous I mean you know yeah I I bet you I'll probably retire in 20 years and I bet you will be heading towards exabytes of data which is you know there's only a handful of people that can even do those experiments right now even with all the sophisticated text so we can't just brute force it right Ben would you agree I mean part of this is is adding finesse and I I like what you said that this is real early right it's not the Finesse isn't baked in as deeply as it could be to be able to scale up this is yeah when you scale up it's unbelievable the patterns you see you know if you look at one the toenail of the elephant and the Oasis you see something but then you look at the whole Oasis and maybe all the oacs you know in in North Africa and then it's like wow what you discover yeah for sure so tell us a little bit about some of the other are you working with other kinds of science I know you know you've worked with Alex and you know biosciences um I imagine this would have an application in you know biochemistry or chemistry or physics or any any other kinds of applications of this workflow technology and AI yeah so when I was doing my PhD um the feltics lab and Life Sciences was definitely the the biggest piece of the pie by far there was a couple of other labs that I collaborated with a little bit like a material science was the main one but but that one even that one we weren't really using workflow managers that was more just about writing GPU code so just sort of the some more like low level optimizations I think Life Sciences was the only one I worked in that was really using workflow managers um I don't know that the other science Fields have have really um gotten into this space yet or really discovered the secret you know I've heard that there are some workflow managers that people use in like Linguistics which kind of makes sense because you know it's also like large text Data um a lot of the more like physical science Sciences like physics and and the kind of stuff that you know typically runs in the doe Labs I think they still they still have their traditional methods you know just like write a giant MPI application very like rigid parallel structure and then you know run it across a giant supercomputer that has a very regular structure to it as well um and and for them that's important because they want to get you know every ounce of performance that they can out of their um out of their Computing it may very well happen that in the future they will also converge on workflow managers like these the only other Big Field that really uses workflow managers I would say is the sort of like um I don't really know what the best word for it is it's just like it might maybe it's data engineering or just ad Tech really all the big tech companies that you know have these these hugely popular websites and they have a lot of machine learning workloads around like analyzing user engagement or like content so you know like YouTube with videos or Netflix with movies and shows you know they're doing all sorts of machine learning work you know analyzes on on the data that they have you know I heard a joke the other day that you know that Netflix is a uh is is described as a basically a metrics generator that occasionally streams movies yeah so a lot of these tech companies they're generating just so much data out of just just so much like metadata you know aside from the actual content itself just the data about how users are using their Tech um they're using machine learning workloads all the time and they've they have their own they have their whole their own whole world of workflow managers entirely separate from the ones that were developed in the life science community and what I'm what I'm really interested to see is if there is ever any crossover between these two worlds because you could totally use next flow to run you know like some random machine learning workloaded Facebook or Google or something like that that has nothing to do with life science data the workflow technology is exactly the same now it might not be your preferred language you know you might prefer to use Python instead of groovy and in fact most of the machine learning workflow managers out there are written in Python um but you know those technical issues aside it really is the same technology and there is a lot of potential for crossover I haven't experienced that yet I'm hoping to you know I would love it if Sakura Labs could expand Beyond just life sciences and actually get to you know basically anyone here like if you need to run a workflow you know come to us because we're your guy we can do it for you all in one place yeah there is a common denominator with you know image analytics I mean anybody I saw I learned this from working with people like you that you know you can use image analytics techniques in biology and it just if you think of things as if you can convert into an image with you know number of pixels and density and color and stuff like that if you can use that analogy when you're doing it it makes things really good but I've noticed that there are there's a little bit of crossover between astronomy and biology actually some some meetings that happen together because they have the same data storage problems distributed compute problems and it's a lot of its image analytics because when you're doing DNA sequencing the first thing you do is take a bunch of pictures of Indian technology and then translate that into atgs and C's of DNA it's like it seems like you could you know if you you have an image analytics workflow that'd be very easy to Port across disciplines if people are aware of it talking right because these communities just don't talk to each other a lot but I have seen it happen and it's pretty exciting to think about you know when you start translating workflows between communities and then you know it worked and what else can we do and that's when I think I've learned from to interdisciplinary work it's just amazing what happens you don't know what's going to happen you just know it's going to be awesome yeah yeah this is why I think that you know workflow managers are not just going to change How We Do Science it's really going to change how we code so really yeah really any any application domain that involves coding I think is you know over the next decade 1500 years is going to be transformed by this sort of new paradigm of how to write code because you have to in order to keep up with the amount of compute that is available um you have to be able to write code in a way that can actually you know utilize all that so so go ahead Alex I was gonna so what would what would you do if you're when if you're coming out with a bachelor's degree and you wanted to do computational you know I don't know if I want to Define it too much like if you want to use computers to do stuff maybe start a business or do science or whatever like what would you rank as like you know the top you know two or three skills I mean do you go through a computer science curriculum and you know do you learn learn some things maybe they don't teach in computer science like workflow managers like how would you net knowing the wisdom that you have like if you want to train yourself to be yourself again like at the undergraduate getting a bachelor's degree level what what do you need to do what do you need to know what skills do you need yeah I suppose there's there's multiple ways to get to it these days you know you ask about like what major I mean you could probably do it through a computer science major or through you know more of a hard science major where you sort of pick up the pieces as you go but in terms of being able to like be a biohacker I guess like yeah yeah probably the first thing I would say is just Learn Python yeah like that's absolutely the first thing just Learn Python even if you never end up using it um it's just a great language for learning things and very likely you will end up using it because it's it's got tons of good libraries and um people call the Zen of python there's just a sort of a piece that comes over you when you run that's just me you definitely have a piece it's that jazz and then uh live in the job and then like learning Linux in the command line um learning how to use um just all these there's just again there's like a Swiss army knife of these little command line tools for manipulating text files and things like that that's a whole sort of skill unto itself um I guess learning to use Google and stack Overflow that's honestly the way most people learn things is like you you want to write something in code and maybe you already know a little python but you're like how do I do this specific thing in Python there's a 90 chance that if you just Google that you'll find a stack Overflow page with a uh an answer with a green check mark and it's like well that there's how you do it now of course the the Eternal caveat is that don't just copy code that you don't understand oh yeah what you can do is you look at someone's answer and make sure you understand what they're saying and a lot of times it is very straightforward and it is just like okay take this code snippet and apply it to whatever you're doing so that's a huge part of it really is like not just knowing stuff but knowing how to find stuff um yeah understand the internet is like this giant sea of of information like a giant like I don't know like a like a like a uh a gene correlation Network you know you're just you're just swimming through all these different nodes from one website to the next trying like okay how do I find the information that I need knowing how to navigate the web um is extremely important to that as well um and then yeah learning a workflow manager which again um not necessarily something you have to learn right from the get-go but maybe once you get to the point where you're you're you're you're trying to maybe when you're learning about actual analyzes that you want to run whether it's like how to train and use a machine learning model or you're actually learning about different ways to process genomic data maybe once you've got to that point then and you and you start to understand sort of the the multi-step and parallel nature of what you're trying to do and then just sort of keeping workflow managers in the back of your head saying like Okay when I get to this point remember this and understand that the workflow manager and the pipeline is how you express those kinds of things and then you know the diff the nodes within the pipeline will be you know Python scripts or whatever you whatever you want um I suppose uh so that's sort of the the coding aspect and then in terms of like the environment aspects so I've talked about Linux and then I guess there would be I think Jupiter lab is is probably sort of a universal tool that everybody would benefit from knowing um and just understanding how you can spin up a Jupiter lab instance anywhere you know you should be able to spin something spin it up on your University cluster or in the cloud somewhere and basic and there are even websites that'll just give you one for free practice AI yeah there's there's one for you yeah um you know so so now you've got it gives you the best of both worlds right because you can get these interactive notebooks so learn yeah and that's another thing I guess is learning how to use notebooks but then you can also still pull up a terminal you can browse all your files all that stuff is still there yeah so um it's very versatile in that sense um you know learning how to use like a like a HPC scheduler or how to use the cloud like on a very low level um I would consider those more advanced topics so get into those if you want but be on the lookout for sort of more abstract ways of doing that so you know maybe you maybe you land at a a bioinformatics startup or a big Pharma company who uses next low Tower and it's like okay well then you don't actually have to worry about how to like spin up an ec2 instance on Amazon you just got to know how to like log into Tower and then you know set up your compute environment which is just going through a menu and then running pipelines but knowing some kind of abstract way to use the cloud um whether it's through Tower or some other sort of abstract tool that I think also be very useful and then beyond that I mean I would say that's all the sort of like uh the the generic skills and then beyond that it's just sort of the the domain specific skills you need right so what types of data sets are you working with and what types of analyzes are you going to be using a lot familiarizing yourselves extremely well with those things especially especially your data like the data structure um you can never like understand your data too much you know there's always a benefit to making sure that you understand the nature of your data how it is being produced what are the different sources of error you know all that all those kinds of things it's a lot yeah no it's exciting you know I was checking the boxes in my mind as uh as Ben was talking Alex you know that all those things are in your Journeys I mean you know the Praxis AI biohacker Journeys a lot of the biohackathons you know are going through you know each of those skills and and those were also cataloged in the uh the biohacker credentials the digital badges so um in wrapping up Ben we always like to ask you know you're one of the original biohackers uh thank you very much for joining the community what is a biohacker to you um I I would say that biohacking is is a big tent so there probably isn't just one sort of specific profile of what makes a biometer like for example I I don't consider myself like a scientist like like a bioinformaticist or a biologist or any of those things I never really like I took 9th grade biology and uh that was it so I I really don't have a background in those domains or any other science domain for that matter um I'm just a software guy a workflow guy an AI guy um and I I partner with other domain scientists to to get things working that doesn't mean that a domain scientist can't learn these programming tools I think you know they should uh for all as much as they can but there is and and that so so perhaps that leads us to saying that there's sort of two complementary roles right there's like the people who are primarily domain scientists and they have the domain knowledge and they also have as much technical knowledge as they have the bandwidth to get right and you know the more they get the more power to them but ultimately there also is a second roll of the primarily technical people the people who know how to write workflows know how to apply different AI tools and reason about them and even people who are or people who are skilled in infrastructure so like how to knowing how to provision resources in the cloud and or on your University cluster things like that yep um and these people support the domain scientists and you know maybe we'll reach a future where you know we are actually Obsolete and it's actually possible to be both a scientist and have all the tools that you need but for now at least um and I'm very glad for it um there are there are both roles and there's sort of a yin and yang to it and they work together and and together we're able to accomplish way more than either of us would uh individually so it's a very it's a very beautiful and and very useful Arrangement yes kind of kind of like a jazz uh combo um so yeah that's that's a really good yeah I mean one musician can play some good music but you get some people together in the interdisciplinary Synergy start to set in right absolutely yeah there's no comparison that's the magic yeah the magic is in the collaboration and the synchronicity between the different players and and people playing different roles um and uh I'll I'll end on uh something that you know Reed said in one of our previous podcasts when he was talking about a tuning mind and uh he said that you know when band members come together and they're playing a particular song or a particular piece of music their brain waves synchronize uh because of the oral you know feedback Etc uh so I think in many ways when biohackers are working together both you know for the from the purposes of domain science and the computer science they're synchronizing their brain waves so that uh you know one plus one equals 19 uh instead of two so uh great segue thank you so much ben it was a wonderful uh podcast we really appreciated hearing from you it was riveting and I will see you around do the closing song[Music][Applause][Music]

People on this episode

Alex Feltus, Ph.D.

Co-host

David James Clarke IV

Co-host