Transcript#

This transcript was generated automatically and may contain errors.

Hey there, welcome to the Paws at Data Science Hangout. I'm Libby Heron, and this is a recording of our weekly community call that happens every Thursday at 12pm US Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

Can't wait to see you there. I will say we are done with our housekeeping and we can go ahead and introduce our featured leader today, Samia Baig, Senior Data Scientist slash Data Engineer at Johnson & Johnson Innovative Medicine. Samia, would you like to introduce yourself? Tell us a little bit about what you do at J&J Innovative Medicine and also something you like to do for fun.

Alright, hi everyone. Thanks for joining. This is actually my first hangout and also my first time speaking. So looking forward, I've been a part of the R community now for eight years, so I know that this is a fun community and I'm happy to be here. As Libby said, I work as a Senior Data Scientist at J&J. I've been here for two years, but I have overall eight years of experience in different areas of healthcare and public health.

But what I've done, I would say, like throughout my career has been, you know, even though technically I am a data scientist by title, I kind of am more a little bit, I would say, a product or analytics engineer. So I sit in between, you know, like pure engineering team and then the business. And really like what my work has been throughout the last eight years, it kind of didn't start that way. I was, you know, started more in the clinical research space, but basically I, you know, heavily focused on making, you know, raw data ready for analysis. And then earlier in my career, I also was just working from like the whole like end-to-end, you know, preparing the data and, you know, developing like the end output needed for, you know, whoever needed the end product.

But right now, currently in my role, I focus almost entirely on just like the data, like, you know, making the data, you know, like ready for either a dashboard or a similar interactive product. I will be actually kind of shifting gears now to prepare data more for like ML AI use cases. But yeah, I think the principles have been overall the same. It's just like, you know, really just understanding the business context of the data and then also applying the technical skills and the right ones for the, you know, right use or, you know, the right audience to just, you know, like streamline the process and, you know, automate when needed. And then, yeah, just, I guess, coming up with creative solutions to make data easily digestible for a wide audience.

Analytics engineering vs. data science

Amazing. And I saw Rob say in the chat, he said, I feel like analytics engineer is a better descriptor for other titles, surprisingly often. Do you feel like that's a good descriptor for what you do, like analytics engineering, that underside stuff? I always need more context in engineering. I feel like it's a big blind spot for me. I am always so much on the modeling or the end user side of it.

Yeah, analytics engineering, I think is an appropriate way to describe it. I know that like, you know, within data engineering, like there's a lot of like, you know, discussion on how to actually name the titles or like, you know, if you fall under data engineer, but I think within it, you know, there's different stratifications and roles. So yeah, I would definitely say that mine aligns more with analytics engineer. But again, like more so recently has been like, where I've had that specific title, because before I was, you know, working more in biostatistics or, and like data science in public health. So in that I was just kind of like, you know, either one person or one of the few people on a team kind of, we were doing a little bit of everything. So yeah, so I think it's just been like, you know, different, different kinds of roles throughout, but I would say overall, like, you know, yeah, analytics engineer describes pretty well, like what I'm doing now, and what I focused on a lot throughout my career.

Background: from pharmacy to data

I wanted to ask you a little bit about your background, your education, and kind of how you got to where you are, were you much more focused on the subject matter, and sort of found yourself in the data space doing data type things over time? Or was data something that you were focusing on, even when you were in school on a really, you know, regular basis?

Yeah, that's a great question. So actually, I'm a pharmacist by training, I do get a lot of questions about that. I went to a six year pharmacy program. So that was like, pretty much right after high school. And so I think it was a pretty big, like, it felt like I knew I wanted to work in healthcare. And I knew that whatever I wanted to focus on was, you know, related to like clinical sciences, what exactly I wanted to do, that's a hard decision to make when you're a high school student. But I got into the program.

And, you know, I knew that I really enjoyed like the, you know, the clinical knowledge and aspects of what I learned there. But I wanted to learn more skills. Because I knew that healthcare, there's just so many things. But like, you know, I think sometimes when you're going through like a, like accelerated program like that, like, you don't have, you know, too much time to explore a little bit like the different career paths. So what I did was that after I graduated, I went to did my master's of public health, specifically focusing on epidemiology and biostatistics. So in public health, that is more of like the quantitative analytical skill sets, I kind of knew that I would go into the data direction.

And I was just trying to learn more skills, because I wanted to see like, you know, I just kind of wanted to diversify, like, you know, all all what I could learn to see, like, how that can synthesize together, and you know, like how I can apply that in healthcare. So before I went into my public health program, you know, I was trying to learn how to code. But I think, you know, something I realized later that when I first just was trying to learn to code, I was learning to code before I was learning to think programmatically, which I do think there's a distinction between that.

I was learning to code before I was learning to think programmatically, which I do think there's a distinction between that.

Because I think initially, like, I didn't understand, when I was first starting out, like, you know, like, why am I like, you know, telling, you know, the command line to make a directory, but I can just do that with the user interface. Like, I never really understood why I was doing things writing code, you know, until later, much later in my career. So like, in public health school, I was much more focused on statistics. And I thought I would become, you know, a clinical scientist or, you know, someone who was doing more research.

And then when I started my career, and I think this is something I do hear a lot in, like, any kind of, like, academic or, you know, research background is that when you go into the real world, the data is not nothing like it was in the classroom, because it's, like, very chaotic in a way that you could never anticipate. I mean, there's just, there's just so many different things I've seen with, like, different data sources. And a lot of my time, you know, really went into preparing and making it ready for analysis.

And then there were so many skills that I didn't realize after in my career that were not as much emphasized, I think, in school, because school focused more on, like, you know, the, the more methodology and the statistical analysis, which definitely has its place. But I think, I think the real world, you know, you definitely see, like, more important aspects of, like, you know, version control, you know, documentation, you know, making your processes, like, reproducible, and scalable.

And that's really where, when I had mentioned that before I, when I had first started to learn to code, I was just learning to code, learning to type syntax, but not realizing what it meant to be programmatic. And I think that's kind of something I think about, especially in, like, this, you know, like, new AI age, where we're, like, thinking about, like, oh, like, you know, is coding going to be as useful or not by hand. But I think really, like, the main thing that stays is that really thinking of, like, okay, like, how do I design my pipeline so that I'm not, like, doing the same thing again and again? You know, how, you know, do I, like, make sure that, like, when I'm, you know, starting out a project that I'm, like, you know, initializing with the Git repository, or, like, you know, why is that even important?

And I think really, like, my eight years of experience has been just learning that, and also learning that in different settings. So, even though I was within, like, you know, the healthcare, you know, overall healthcare domain area, I've worked in academia, a hospital pharmacy, public health department, and now currently in pharma. And there was this tweet, I think I read once, that said that, like, you know, sometimes when you start a new job, it's almost, like, starting, like, a new season. Like, if your life was, like, a TV show, it's, like, starting, like, a new season. And even though I work within the healthcare realm, I definitely felt that, because, like, all the problems, all the data, they're all just, you know, different from each other, and that by itself is, like, a learning process.

So, really, yeah, I mean, I think that's really how I got, I would say that I kind of started with data in mind, but I didn't really think that this would be the direction my career was going in until I started, you know, working. And I think even, yeah, even with the years of experience I have now, I mean, there's always, like, you know, trying to figure out, like, is this where I want to be? Or, like, you know, like, what are other aspects of data that I can learn? Because there's a lot of, like, you know, niche areas within data as a broad field. So, I think there's always a lot to learn.

There is. It's so validating hearing you say that first principles were really important, because as a person who teaches other people to code, it's so, so difficult to say, like, oh, well, you know, just keep doing it and keep being messy and keep making mistakes, because over the course of the next three years, you know, you're really going to understand why this or that is valuable. I really think that that time is needed to understand why those things are important. You can't skip through and just understand intuitively why it's important to use version control until you've maybe messed up several times and not used version control.

And that giving yourself the space to spend that time is really, really important. So, it's wonderful hearing you say that. And I see some things in the chat, too, that are very complimentary. Grace says, love your thoughts on this. Started teaching reproducible workflows and version control in first year at UBC, but personally didn't learn that until my PhD. That is very, very valid and common, I think.

Transition from public health to pharma

And as an aside, I know that you said you had worked in Department of Health. That was New York City, right? You worked in the New York City Department of Health. And I think we had a question. We had a question that was an anonymous one that said, what was it like in that move that you made from public health to pharma? And maybe that could also include what did that job process, job search process look like? And did all of your experience transfer really, really well? But you were already a certified pharmacist by that point.

Yeah. So, I think the thing is when I was... So, I did feel that when I was going into public health, especially working in the Department of Health, that besides my pharmacy background, having the public health skillset was important. Because in pharmacy, we learn at a very high level the clinical research skills. But it was in my public health program that I learned much more the nuance epidemiology and biostatistics courses. And I felt that a lot of in the public health job descriptions, that was highly emphasized to have a very specific public health background.

But there's a lot of jobs within the health department. And I mean, I think even as new administrations come, there are new priorities. So, that always, I think, also influences the type of jobs that are available at any given time. And also, I think the Department of Health has evolved too. After I left, they also have started, I think, more of a data science group within or a data science team. And I worked specifically more under a healthcare policy team. So, it wasn't specific to data science, but we had a team of people who were data scientists, as well as PhD scientists. And we were all working on the research and evaluation aspect.

So, I think just preparing for, and I would say that the transition from public health to pharma, there were similarities and differences. I mean, I think the similarities were really just being able to work with different data sources within the healthcare domain, and understanding the stakeholder who needs that data, and then understanding just the process, and making a data set that's like, some of the data is like survey data, some of it was our own internal data, and really just trying to understand the ins and outs of what's needed to get the data readily available for the person who needs it. I think that's a skill that transferred in between all of my jobs.

But I think it was the overall domain knowledge of just healthcare and public health helped me get my job in the Department of Health, and then pharma as well. But pharma definitely is very different. And I think it's also, when I say pharma, I'm not speaking for every single team, obviously in my organization or other pharma companies, because I think it's just a very huge workplace. And I think there's definitely different teams who may be doing different things that I don't have full visibility about.

But I think I specifically work more on a data science team, which is a large team, and all of us are data scientists, but I think a lot of people come from different and diverse backgrounds. So similarly, I came from a pharmacy background with a public health degree, but then you do see some people who have a technical background, came from data science, or people who had more of the scientific background. So I think it's just that the type of data I use is really different. The type of questions I'm answering is very different. More so related to specific treatment or therapeutic areas, whereas in public health, it was more broadly looking at programs and initiatives within the city, still within the healthcare domain, but we were just answering a different question or doing something differently, because it was more like, I guess, public programming.

But yeah, I would say that there's both an overlap, but then I think maybe the differences just come in the type of work. Perfect. I mean, that goes for all kinds of differences between industries and jobs too. Well, thank you, Anonymous Asker, for asking that question, and we have a whole bunch of questions in Slido now, so let's go get to them.

Data scientist vs. analytics engineer responsibilities

Noor, I saw you had asked a question. Would you be willing to unmute and ask that one live? Hello. Can people hear me? Yep. Awesome. Okay. Hi, my name's Noor. It's for Samia, and forgive me if I'm messing up your name. Is Samia correct? Yeah, that's correct. Okay. As someone who works in both the data science and data engineering spaces, what do you say are distinguishing factors in the responsibilities of said roles? Because I know that often sometimes there's like some opaqueness between them, as well as what would you say is unique to, I guess, J&J as a data scientist or data engineer?

Sure. So I think that even though my title is data scientist, and then in the past I was working as a data scientist, and then sometimes it'll be more like my title was data analyst when I started my career, I definitely feel like in healthcare, especially when I started my career, because I think in the beginning there weren't as many distinctions between like data scientist and data engineer. I feel like more so recently within the last maybe like five years, I'm seeing much more of that difference, I think just because data scientists were doing kind of all of that work.

But I think the thing is that technically speaking, I think data scientists are considered, you know, the professionals who work specifically more on the modeling after the data engineering work has been completed. So data engineering, analytics engineering, technically speaking, is like an upstream responsibility. But I think like, and I am more so like, you know, in my workplace, like an analytics engineer. But then again, like I said, I can't really speak like, you know, for my entire organization, because I don't know, like, you know, in different teams, like, if there's that distinction or not.

But I think the thing is that in my past experiences, there weren't really like a distinction. And I think it also just depends on the company and like the setting of your role. Because like I mentioned earlier in my career, I was working more like, you know, kind of like a solo data person. And in like, you know, in the pharmacy setting, or in like Department of Health, I worked on a team. But then again, in academia, when I first started my career, I was working as kind of like a solo data person. So I think in that I wore many more hats.

And I think it's just like, the way any company's like kind of team is organized, or like, you know, what specific project, like, you know, data project they're working on, like how that team is organized, I think maybe just really probably is probably how those roles are assigned. So like, I mean, I think in a smaller organization, maybe like more of like a startup type, also, like, you know, probably like, there's kind of one or two or few people who are wearing many hats, whereas like, in a more larger and like robust organization on a bigger team, I think like, you know, these role differentiating roles, like data scientists, analytics engineer, product engineer, like, they're a little bit more stratified.

Tools and languages

Thank you, Samia. That was a fantastic question. And I would love to follow it up with an anonymous question in Slido. That's perfect for our about halfway point where we are here, because usually by this point, someone has asked this, and here it is. So it is what are the most popular tools that you and your team use in your job at J&J? So languages, like, are you an R person, Python person? Do you all use SQL? Do you use transformation tools, data warehouses, cloud services, stuff like that? All that would be fantastic context.

Sure. So currently I use, so I would say like, the two years I've been here, I was using a lot more SQL than I ever did in my career, which is really good. I mean, I think that, you know, throughout like, you know, my time in data, I've heard a lot, you know, the emphasis about SQL being, you know, continuing to be for in the long term, like a very, you know, one of the most important, you know, technically is not programming language, but I mean, it's like, you know, a language that I think really, you know, is the mainstay for any kind of data work.

Initially, it was a hard transition because I started my career almost exclusively working in R and Python. So it's a very different type of mindset. You know, I kind of think of SQL being like, you know, like those dolls, which they kind of go into each other, like, you know, like a little. A nesting doll. Yeah. So it's kind of like, that's how I think of like, like using SQL, but I've been using, that's been kind of my work because I work a lot more on, as I mentioned, the analytics engineering side, and, you know, we use a data warehouse, and then I use SQL to, you know, combine data from different sources and make it ready for like a dashboard or similar tool.

But I have used R and I do use Python time to time. And I think in my current job, I use, I would say I use Python a little bit more than R because I was developing more scripts or, you know, programs to do like more data checks or, you know, just like trying to automate like something, you know, like that I needed to like get the data, you know, ready or things like that. I, we use AWS currently at my job.

But in the past, um, yeah, in the past I had used, when I had used a lot more R, I was using a lot of different data sources, you know, like survey data, electronic health record data, and, you know, like medical claims data, things like that. So that was like, and we also didn't really have like a, you know, at that time didn't use AWS. It was a little bit more, um, I would say just kind of an arbitrary process, uh, to like, you know, get the data, um, you know, kind of ready, you know, developing these like R and I think that's really where R and Python, especially I use R a lot more became important because it was really like, you know, we were trying to pull data from like, um, like, like, you know, if it's like survey data or electronic health record data, um, you know, different sources.

And it was just, I think a little bit more of like a, uh, programmatic way in which like, you know, I, I get the data and then I'm like, you know, trying to like, you know, create like a Excel workbook out of that or create like a R markdown out of that. So at my job currently, I don't create the end product. So that's really where I think the SQL use, um, becomes more important. Um, but yeah, I mean, I think those are the, the, I would say like SQL AWS are like the main tools I currently use. And then Python for, um, you know, just some like, um, ancillary processes to, you know, like do QA's and things like that.

Perfect. Well, there was a, a question that is a nice little quick follow-up with this one, which was, what do you use SQL in? And I'm guessing that is like, what type of environment do you write SQL in? Um, this question says I've done some intro stuff, but I never run into much opportunity to use it in my day-to-day and I want to get better so I can apply to more jobs requiring SQL.

Yeah. So I currently use AWS Redshift and, um, I think the free tool that you can use, um, I mean, I, I know that some people are using, um, other tools, like more like open source tools. Um, but, uh, I can't really speak to that as much. The thing with SQL was that I, um, understand that, you know, kind of conundrum of trying to get, you know, like, cause I, I think like I was just R and Python has a much more friendly open source, uh, environment. And like, I think just being an RStudio user, um, and eventually having it evolve into, you know, Posit, um, I felt like there was more like, you know, this one-stop shop. So I wasn't really using SQL as much, um, to prepare like, you know, for, for like other roles in which SQL might be required.

And so really I learned SQL. I would say that, you know, if, if someone does have like a R or Python background and then you're going into SQL, I mean, it is a different language, but I think that's enough to prepare you, you know, to eventually use like SQL at the workplace. Cause that's really where I, I would say like, just, just, you know, doing SQL at my job was where I, um, became really good at it because like, you know, I would see exact, exactly the complexities, the problems and things where then, you know, I could think through it.

Cause sometimes I think when trying, that's also another thing, like, you know, when trying to learn something in your own time, um, you know, you're given these like practice problems and they're fine for like, you know, the introduction, but they don't necessarily, I think they're, they're not going to be as complex as you see in the real world and there, you know, you definitely will. I mean, like it's, I think that, that kind of concern that like, I'm not going to know enough. Um, I think if you already know the language, like that gap is usually filled and it's just kind of a matter of like a different syntax. And then also maybe just like learning, like one thing with SQL is also just learning like that, um, like CTEs or the sub query nesting, because that's what makes it, you know, different from like the Python and R mindset. Um, but I think like as long as you still have that same principle of, you know, analyzing data, um, I think that, um, that learning curve shouldn't be as big.

Making data pipelines more robust

Um, and it says one of the issues I'm having is convincing my team to make the data pipeline more robust. They are more content to leave it and then work through the panic when it fails. How would you go about convincing them of the need for a more robust pipeline?

Sure. Um, so I completely understand, like, I, I think what happens is, um, my understanding is the reason this happens is that I think, um, you know, sometimes like you'll get projects and then you want to get them out really quickly. And then, you know, you, you want to make sure you have something to show. And then like, you know, um, with whatever, however, the pipeline is, because I think the thing is sometimes making a pipeline more robust is, um, requires some upfront time. And I think it's really just trying to make people understand, like, that, um, I would say, like, maybe like that balance between how much time should that take? And then, like, you know, what's kind of, how would that kind of improve, um, the end result?

Because I think sometimes what happens is like in the moment, you know, trying to get things out quickly, just using whatever we have, trying not to fix the upstream processes, um, is something that is very, I would say like typical in like, you know, a lot of data science environments, at least like what I've seen. So I think the thing is like, I don't have like a very, I would say, concrete answer to this. Um, but I think really it's just, um, kind of starting kind of, I think, like, you know, being a, like, if you've already identified that is like identifying, um, kind of small but powerful places where you can fix things, um, you know, something that's like maybe not as time intensive, but then that can like really have an impact on making things, you know, better for the long run.

Um, I think some things are, for example, like if there's not enough documentation or like, you know, you see like a documentation gap to like really try to like, you know, fill that. Um, I personally feel like also in SQL, um, you know, because I use that primarily at my workplace, is to really, especially if you have like much more complex processes, is to really find a way to make, um, the SQL easier to navigate. So like I'm a very huge advocate for just using CTEs and not multiple nested subqueries. And I mean, in the, in the whole grand scheme of like, you know, making a more robust pipeline, that probably sounds like a small minor thing, but it actually makes a huge difference because down the line, if someone needs to like, you know, someone else needs to own the code, uh, someone needs to read it or make like a fix to it, um, it's very, very difficult to debug.

Um, I don't know if that answers maybe like a specific aspect of like, you know, um, like in terms of like pipelines, because I think also like different companies have different processes of like, you know, how they develop the pipeline in the first place. So I think like it, it would be hard to completely like, um, and especially if you have, if you work in a bigger company to like totally do the redesign, cause there's so many people involved in it. But I think just thinking of what you can do in your role is really trying to find ways to, um, really like, um, you know, reduce, reduce anything you can in which would make, add more maybe technical debt in the long run or make, you know, the process harder, um, for like, you know, like the pipeline to be able to, you know, for like, you know, like, like if you need to transition it, or if you need to have other people work on it, um, I think just kind of identifying like small areas and, um, being able to tackle those.

Awesome. Thank you. And I saw somebody had commented on there. Can you define robust? Um, and maybe you can speak to this a little bit, but I think that I, I define robust in a data pipeline as hard to break. And if it does break easy to figure out what broke so you can fix it. So having logging, um, having logs and then also just making sure that some, at least some things are automated so that there's not this process where like, it's waiting for a person to go in and do something all the time. What do you think Samia?

Yeah, I definitely agree. I mean, I think that's like, you know, I think like just being, making it so that it's like a difficult, like, you know, I think like I gave in the SQL example, like, you know, the more you write a very complex query that only you can understand the more difficult it is for, you know, like debugging down the line. And I think like that's, um, at a very high level, like I, I do feel like that's how I got into like R and Python, um, to begin with, because, you know, a lot of processes when I started were like in SAS or were very fragmented. Like they were like, um, between SAS and then copying and pasting SQL and things like that.

And so I think like another thing is that, um, like what you said Libby, like, um, you know, just trying to identify ways in which you have to minimize, like someone going in the code and changing lines or someone having to do something. And I think like really like, um, in programming, like, you know, using like a language like R and Python, like, I think you can, you know, use more, there's more an opportunity to use placeholders or like, you know, variables that can, you know, replace like the dates, or like replace like the names of like, you know, because like, you'll be basically able to reuse that. So I think just like having that principle of like, reusability and scalability in mind.

Just like data science, where you're like, I'm gonna write a function for this instead of going in and changing my hard coded values 50 times. Right? Yeah, we're gonna make this parameterized, right? Yeah.

Thinking programmatically vs. just coding

Okay, perfect. I want to hop over to Amelia's question. It says, I think the distinction you make between thinking programmatically versus coding is so important. Getting past that I could just make a folder with the GUI with the GUI. Why would I use code mindset is a challenge when teaching? Do you happen to have any specific resources or strategies or examples that might help students or others with that mindset?

That's a great question. And it's interesting to think about. Yeah, I think. So I think maybe I'm trying to think a little bit. Some of the best examples are just like, yeah, here's what I used to do. And here's the years of experience that taught me why that was so important. Right? I think the thing is, like, maybe an example you could do is like write a script, which like has to be developed every month or something like like, and with some kind of cadence, right? So like, it's like developed, like, every like, like, maybe like, especially if you're teaching a class, like, maybe have a week where they have to like, write a script that like runs every day. Because I think the thing is, then like, you know, and then like, that script should be saved in a specific folder with a name that distinguishes the date, something like that.

Because I think like, when I was initially learning, particularly in my statistics, you know, classes, the idea was that we're just writing these scripts to like, you know, run like, you know, for specific analyses. Once you know, we get the analysis done, the paper we publish and everything's great. But reality is that, you know, like that same analyses will be run multiple times. And also, I mean, that depends on like, you know, if you're working in academia, versus like more of a practice setting, like public health, but like, you know, a lot of times, like, you know, a lot of scripts need to be run, like, maybe like periodically, or maybe like, you know, I need to, even if it's not like, you know, on a cadence, like, you know, I need to run this, like, in a couple of months, when I'm asked, I just should be able to, you know, press that, or just like, you know, basically be able to, yeah, like, open our run that script, and then get the output.

And that's kind of it. So I think maybe like, in terms of teaching is like kind of developing, like, an automated pipeline, you know, like, creating a report, like, I would say, like the first start with the basics, like creating the report, have making sure the report or whatever analysis does what you wanted to. And then the last part is really like, you know, trying to, like, maybe use like task scheduler or cron are to like, run that report every day and have it sent to your email or have it saved to like a folder. And then I think that will really enforce like, why, you know, those, those skills are really important in a lot of places, because I think a lot of people don't really see that aspect of like, you know, these analyses usually are not one and done, but they're usually happening again and again. And, you know, rather than me having to do the same thing again and again, where can I minimize that gap and just like, you know, do it, like do it so that it kind of runs more or less on its own.

That's the dream, right? That's the dream. I mean, I can think of four things that I've done in the past six months that I have thought, oh, I'm gonna have to do this again, but I don't have time to do it the right way this time. And I need to just do it. It's tough. Like, if anybody is out there going, oh, I would love to do that, but I don't have time to like, make sure that I don't have technical debt later. I see you, it's hard.

Non-data science skills

Are there any non-data science skills that are very helpful in your either current or former job?

Yeah, non-data science skills are, I would say, I mean, I think maybe data science skills technically should include some non-data science skills because it's not just all technical. You know, I think like one, like, and those, and sometimes I think it's hard to define those skills because like, it's easy to say, hey, learn R, Python, or like, you know, maybe like use Denny. I don't know. Like, you know, I think it's much more easier to kind of point to a tool, whereas like the, the soft skills are a little bit more abstract. Definitely, you know, communication, but communication also is, I mean, a lot of people say communication, right? Like just communication, but that, that looks differently, you know, but you have to kind of, I think, really understand your team, understand your stakeholders, and really kind of like put yourself in their shoes and think like, okay, what, like, you know, how are they thinking about this versus how am I thinking about this?

And then really, I think with that understanding, it also forms like, you know, good relationships, you know, also just, you know, maintaining a good communication channel to make sure that like, you know, people are in the loop with whatever's going on. And also, I guess maybe not making assumptions like, you know, like this person's a non-technical person, they won't know. Like, I think really just like, like I said, like, like just totally understanding, you know, what that person's understanding is and going off of that.

And one, like, concrete example I'll just give, like, in terms of, like, a non-tech, technically non, you know, data science skill is that, like, you know, a lot of people love dashboards. Like, and I think maybe, like, people assume that everyone, you know, whatever I will produce should be a dashboard. But not everyone is used to that. Like, I mean, there's a lot of people who just like a color-coded Excel sheet, a very, like, you know, maybe like a Word document or just something, you know, that's just sent, like, in a small, like, you know, HTML table. And I think it's really trying to understand what their needs are and what works for them by just putting yourself in their shoes or seeing, like, you know, what processes they're used to or how they have throughout time really, like, been in taking data insights and then being able to develop solutions based on that, I think is probably, I would say, one of the most important skills I've learned, you know, throughout.

I think it's really trying to understand what their needs are and what works for them by just putting yourself in their shoes or seeing, like, you know, what processes they're used to or how they have throughout time really, like, been in taking data insights and then being able to develop solutions based on that, I think is probably, I would say, one of the most important skills I've learned, you know, throughout.

Domain knowledge and cross-team visibility

My question is, and I guess it's not, I'll start with, you know, trying to understand exactly what subject matter you reside in. Is it pharmacovigilance with epidemiology or kind of where you sit in the organization? But then beyond that, my question is really, do you, is there any kind of community among data scientists across the whole company, right? Spanning potentially like discovery, CMC, non-clinical, clinical development, pharmacovigilance, you know, just to what extent are you connected with other data scientists across the organization? And to what extent do you kind of have visibility into who's doing what? So, you know, because ideally I think a lot of data science, you know, in different contexts, there's, you can learn from each other, even if you're working on different subject matter expertise.

Yeah. Yeah, I definitely agree with that. So I work in the data science and digital health team and we work in research and development. So I work more on solutions for, you know, different clinical trial stakeholders. But then like, you know, I also, you know, will be down the line involved in projects, which will kind of in the same lane, but just like different aspects of, you know, that work. So I'm still learning and, you know, maybe like my answer to that question is also that, I mean, you know, I just don't know, like completely all of like, you know, the different aspects of what's there, because I will also be getting involved in those different aspects.

But I think one thing I have learned in, you know, working in pharma, it's been two years now since I've been at my current role, is that particularly J&J is a very big organization. There's a lot of data analytics in many different aspects that I can't really speak to and don't have full visibility, but I'm trying to learn because I mean, sometimes I'll just see someone's like title or something pop up and I'll be like, hey, can we schedule a one on one, I'd love to learn more about what you're doing. And I think for me, that's really helpful to, you know, just have more visibility, particularly as someone who started here, like, you know, just a few years ago.

But yeah, but I mean, even when I worked in the Department of Health, like even though New York City Department of Health, it's just one city's department, there's like 6000 plus employees. And then, you know, we're just one agency, in like a broader scheme of other agencies. And the things I worked on were like, I was a little siloed, you know, from like other teams, and then it was sometimes just other teams who are just in a neighboring bureau or team who are doing very similar things. And I didn't always have full visibility on that.

But I think the thing is that like, and hopefully I'm answering this question correctly. But like, yeah, I mean, I think I do agree with, like, because sometimes I believe that it's very collaborative, like sometimes what happens in silos is that like, there are people doing some similar things, and then some different things, it's always really good to know, like, you know, all the different aspects of data that, you know, even I don't like fully know, just because I've never been exposed to someone, you know, working on a specific aspect.

In the Department of Health, how I actually got really involved in knowing what other people are doing was, I worked during the COVID-19 pandemic. And so in that situation, emergency situations, we got kind of regrouped into other data, other data teams, or other like, you know, groups, and focusing more on responding to the pandemic, rather than doing kind of our daily activities. So things like that, I think, helped a lot. And, you know, like, how I'm like, bringing that back in my current role is just really like, you know, you know, every time we have these like periodic meetings, or even if we don't just like kind of seeing, if I just kind of see someone like, you know, you know, like working on a specific area, like commercial or things like that, like, I'll just reach out to them.

But it's a little bit more up to me now to like, you know, make that awareness. And I think maybe like, from an organizational perspective, it's, I mean, I think I think sometimes like it, there are initiatives like, you know, to kind of do that. But I think so in large organizations, I noticed that it's easier to get a little bit siloed, I think, just because of the volume of people working on different things.

Absolutely. Even I mean, I work at Posit, which is a much, much smaller company, much, much smaller company than Johnson & Johnson. And it's really up to me to build my community as well. Just because we're all we're all doing such disparate things. Gonzalo had said in the chat, like making a community is not a spectator sport, you've got to take action. And I agree. I think it's really rare for for companies to actually prioritize and have someone organize an internal community of practice. But nothing is stopping you from reaching out to a bunch of different people, including end users of your data products, and just getting to know them and shadowing them finding out how they use what they use, what their pain points are, stuff like that.

Biggest data challenges

I mean, you work in all these areas, you're in J&J, a very big organization. You know, you've got data that you work with all the time, and then data sort of several steps removed from, you know, the things you deal with. So if you could speak to a little bit, what are the biggest data challenges you have? Is it internal versus external type data? And if you had a magic wand, what problem would you solve in that area?

That's a great question. Yeah, I think maybe I might have alluded to like, or maybe someone else might have asked a little bit in terms of like, you know, the pipeline robustness question. I think the thing is that sometimes like the issues with, and not just specific to any company, but just overall is that, you know, an idea comes, we want to do it quickly. Let's just do what we can. Like, it's almost like, you know, I think we're just kind of like, we're not, you know, focusing on the more upstream processes, but then just trying to leverage what we can downstream to get something out. And I think that it's kind of like sometimes what happens in data, I think that these can be a little bit like band-aid solutions, like we're just trying to get things out, but then we're not like, you know, seeing like, you know, if we fix that upstream problem that needs to be fixed, then this can, you know, significantly reduce the time I spend on three different projects to, you know, make that change within three different projects.

Or like, you know, if I just kind of like, you know, think more, and it kind of just, I would say, goes back to like that mindset of, you know, developing or really developing our pipelines in a way which we can really just minimize all of the fixes happening way downstream. Because then, like, you will just keep on making these individualized solutions downstream when some of those solutions could definitely be, like, you know, fixed upstream, and then it could scale to multiple projects, and we wouldn't, in the long run, be spending so much time doing that.

And I would say maybe, like, that's a kind of thing, like, you know, that's something I would love to maybe fix. I would say if I had a magic wand, if a magic wand can write documentation for me, that'd be amazing, because documentation takes a lot of time. Like, it's like one of those things which are kind of dry, but you know that, like, it's very important. And sometimes I enjoy documenting because it's almost like a teach-back to me, like, oh, this is all I've done, and this is what it is. But it can be a very time-consuming. So, I think that if I had a magic wand to just document all the thoughts that in my head the exact way I wanted it to, you know, just write that documentation, I would say I wish there was an easy way to do that. I know some people are using AI for that, but sometimes still you need to contextualize, like, exactly, like, the thing. So, I think with AI, like, maybe, you know, it would still be like a hybrid process and trying to, you know, create documentation.

Career advice: join a community

With our last five minutes, I would like to ask you a career advice question. Samia, what is a piece of career advice that maybe you wish you could go back in time and give yourself, like, a younger version of you when you were just getting started in data or maybe leaving school or something that you like to give to others?

Sure. In terms of career advice, so, I think for me, the biggest thing was just, like, joining a community. I mean, I'm not, like, saying this just because I'm in, like, you know, a data science hangout. Join this one. Being for real, though, because, like, I joined, like, I would say, like, a lot of my learning happened probably even outside of my work and, you know, in the community. Like, I joined Tidy Tuesday. That advice was actually given to me by someone when I worked in my first job because we did have a small but mighty R user group at my workplace, and I think just joining more groups, like, there's a really great, and I mean, I think everyone's already kind of there, but just, like, even, like, in person, if your city has one, those were really amazing. I met Rachel at the R user group when I was in Boston many years ago.

So, it's, like, I just really, like, you know, learned people, learned different things. I was a little, you know, intimidated, and sometimes, you know, I mean, I still am even all these years just, like, going to groups and being, like, oh, these people know so much and, like, you know, what do I, like, ask or what do I do, but I think the community, like, you know, I do notice that a lot of people very much are on the same page, no matter what industries we work in or, you know, whether we're starting out or years into this, like, more or less everyone's on the same page.

But I do feel, like, you know, like, joining these communities and particularly not just joining, like, you know, something more that's, like, a talk session, but, like, I think Tidy Tuesday, when I was doing that every Tuesday and then eventually I did, like, the 30-day map challenge, it was a very excellent way, like, really to just learn from other people. Like, I started feeling, like, I'm okay at ggplot and now feeling, like, really, really, like, excited and good, you know, about, like, the work I've done just because, like, you know, I, you know, just, like, other people shared their ideas and then, like, you know, I also started to, like, it kind of, you know, helped me think about, like, you know, my own unique perspectives and then really, like, you know, think creatively through R code and things like that.

And I think, you know, even though that was more specific to dataviz, but, like, it also just kind of helped, you know, me get better at writing better code, writing some R packages in my own time and really, like, you know, also just, like, re-emphasizing the importance of version control. So, I do highly recommend for anyone who is, like, you know, interested in, you know, open source languages to really try to join some of these, like, more practical, you know, like, projects or, like, social community-based projects because I, it really helped me get from a point where I was, like, okay, I feel somewhat okay, but, you know, I could do more to, like, more confident, and I think that definitely helps, like, you know, bridge some of that, bridge those skills.

I do highly recommend for anyone who is, like, you know, interested in, you know, open source languages to really try to join some of these, like, more practical, you know, like, projects or, like, social community-based projects because I, it really helped me get from a point where I was, like, okay, I feel somewhat okay, but, you know, I could do more to, like, more confident.

That's fantastic, and you know, I 100% agree with that and want everybody to get involved in the community and share your work out loud. Thank you so much, Samia, and this is where I will plug, again, the Data Science Lab because we have so many things that have to do with what Samia just said coming up. March 10th, we're going to have John Harmon, who is the maintainer of the TidyTuesday GitHub repo. He's going to show us how to submit a TidyTuesday dataset, how to curate a TidyTuesday dataset, what happens behind the scenes of how, you know, when they get submitted and curated and how they end up on the TidyTuesday repo. And then the next week, we're going to have Joey Marshall, who's going to be live coding analysis of a TidyTuesday dataset, I think, using Cloud Code, and this is the one where you'll see him use voice transcription.

So, if you're just curious about how other people work, you want to see some live coding, come join us at the Data Science Lab. It's where we share screens.