Spotlight: Lauren Maffeo (Author) Interview and AMA!

PragmaticBookshelf · 9 February 2023 20:00

Author Spotlight:
Lauren Maffeo
@lmaffeo

Businesses own more data than ever before, but it’s of no value if you don’t know how to use it. Data governance manages the people, processes, and strategy needed for deploying data projects to production. But doing it well is far from easy: Less than one fourth of business leaders say their organizations are data driven. We talked with Lauren Maffeo, author of Designing Data Governance from the Ground Up, about data governance, data life cycle management, and data ethics, among other topics. You’ll be surprised and informed by our conversation.

This is also an AMA. Everyone commenting or asking a question will automatically be entered into our draw to win a copy of Lauren’s ebook!

lmaffeo · 9 February 2023 20:04

Hi Lauren! Please introduce yourself.

Hi, my name’s Lauren Maffeo. I am a service designer at Steampunk, which is a human-centered design firm based in the Washington, DC, area serving the federal government. My specialty is designing digital products and services for federal government agencies. I just wrote a book called Designing Data Governance from the Ground Up. It’s a 100-page, six-step guide to building a data-driven culture in your organization to help you automate your data governance standards into your production pipeline.

What is data governance, and what does it mean to use data governance in an organization?

When I talk about data governance, I’m really talking about the combination of people, processes, and tools used to help organizations manage data at scale and with quality.

Quality is a nebulous word, kind of like data governance is a nebulous topic.

When we talk about data quality in this context, we are talking about it in the context of answering the question, “Is this data fit for intended use?” So that means, “Can I trust the data? Has it been vetted? Is it beholden to particular standards, either at a policy level or at a more insular, organizational level?”

The reality is that many organizations today—most organizations today—can’t answer that question because they don’t have any standards by which they are measuring their data’s success.

“The reality is that many organizations today—most organizations today—don’t have any standards by which they are measuring their data’s success.”

We know more data is created today on a daily basis than has ever existed before. That’s really what is driving this new wave of artificial intelligence (AI). AI itself is not necessarily new, but the proliferation of data is new.

The problem that we’re seeing at a high level is that there is more data than ever before but the vast majority of it is unusable for many reasons. The biggest reason is lack of quality, which goes back to lack of standards.

This means that there’s no ownership over data, and people at the audience level can’t always trust the data that they are consuming. That results in a lot of wasted data that is not only unhelpful for businesses, it can actually be a liability.

That’s another challenge I commonly see with organizations that have a lot of data: they hold that data in their possession longer than they should. They have no mechanisms for when to get rid of it or how to get rid of it.

Again, that creates a liability for organizations, especially if you do business in the European Union. That part of the world gives citizens a lot more rights over their personal data than the US currently does, although the tide is shifting a bit there.

So ultimately, like I said, data governance is the people, processes, and tools that empower organizations to manage their data at scale by co-creating standards across the organization.

What are some of the ways that organizations harness and use data, for anyone who is not familiar and in the industry right now?

It really depends on the sector that you’re talking about. I would say, broadly speaking, there’s a lot of conversation in the media about artificial intelligence and being able to use AI techniques, like machine learning, to do all of these amazing things.

There’s also a lot of rhetoric discussing how AI will “replace humans” because it will be able to do jobs better than humans can. There’s a real misnomer around that rhetoric, which I think is problematic. AI is basically data that is trained to perform very particular tasks. So for instance, you can use an AI technique called natural language processing to comb through large amounts of documents and find keywords or phrases. If trained correctly and trained on strong data that has been governed, AI can perform that task better and faster than most humans can.

But the challenge with that is, I think a lot of people in the conversation forget that in order for AI to really excel, you need high-quality data: data that has been vetted, that, again, is beholden to particular standards that fit your organization’s business needs. Without that, you are training AI and machine learning tools on bad data, and then you can’t be surprised when you get poor results.

That’s a big challenge that I see when folks are trying to implement advanced technologies in their organizations. For them to do that, they really need a certain level of data maturity that starts with governance. I would go so far as to say that if your organization is not adhering to data governance strategy and standards, you’re not ready for machine learning and AI and all of the benefits that using those technologies to perform particular tasks will bring.

“If your organization is not adhering to data governance strategy and standards, you’re not ready for machine learning and AI.”

What are some of the challenges involved in capturing data at scale?

I think there are a few that I see in very common ways, especially with my clients. One is not knowing where the data lives.

I have come into several organizations where the stakeholders involved had a lot of data in their possession that they owned, that they collected. In one case, this client’s entire job was to disseminate data. But when my team, including the data architects and engineers I worked with, asked where that data lived, we couldn’t get straight answers.

Did it originate on someone’s laptop as a CSV? Was it on on-premise servers? Has it been migrated to a cloud environment, like AWS? We couldn’t get very basic answers to those questions, and of course when you are tasked with building out new pipelines and user interfaces, you can’t bring in the data if you don’t know where it lives in the first place.

“You can’t bring in the data if you don’t know where it lives.”

Many people in many organizations today cannot answer that fundamental question. That’s really a stopgap. If you can’t find out where your data lives, that means that you can’t create APIs or other technical solutions to solve challenges, where you can maybe integrate those servers and databases in a cohesive way, or you can then start a migration and migrate the data over to a new environment. There are, of course, challenges with migrations in themselves, but if you don’t know where the data is to begin with, you can’t solve the problem.

The other challenge I see is the perception of data ownership and governance as a top-down initiative that comes from IT and is owned by IT to the exclusion of everyone else. And that’s a problematic culture style for a few reasons. One is that top-down governance rarely works in larger practice and at scale because people fundamentally do not love being told what to do, especially about something that directly impacts them.

That top-down model where IT owns all company data might have worked in the past, but today, every person’s job in your average tech company involves data.

“That top-down model where IT owns all company data might have worked in the past, but today, every person’s job in your average tech company involves data.”

Whether it’s your marketing director, your VP of sales, your customer success manager—they all deal with data. They all have to make decisions about their areas of the business based on data. And cutting those people out of conversations about data governance and management creates silos that impede data maturity. So that’s another problematic thing I see: not involving enough colleagues in the organization to co-create data standards for their areas of the business, which the technical team can implement in the dev environment.

And then the third thing that I see very often is a real lack of documentation. I know that many people in tech do not enjoy writing documentation. That’s an understatement, I think, but it’s really essential for many reasons. Without documentation, there’s no record of what you’ve done, of where data lives, of who owns it.

Without that very basic information, what happens is that you have new team members or new consultants join your work, and rather than reviewing the documentation and then making strategic decisions about what to prioritize, the lack of documentation creates a lot of new work that you have to do.

That’s a real blocker. At bare minimum, it wastes a lot of time. And then if there is documentation, it often gets buried in a SharePoint or a Google Drive rather than being stored in a consistent location, let’s say a Confluence page or a wiki that everyone in the organization uses.

So those are the three key areas that I see as being real stopgaps for companies that want to manage their data at scale.

What does it mean to have data stewardship? What are the elements of having data stewardship, and how do you build a team to support that?

Data stewardship really goes hand in hand with owning data and being able to make key decisions about that data. So what that means is that the data steward really is the point person for making those decisions about data within their particular domain.

When we talk about domains here, we’re talking about specific areas of the business that have data within it. Data stewardship really is data oversight within an organization, and a data steward is responsible for ensuring the quality of the data in their domain, including the metadata. That’s really important.

Metadata is commonly discussed as data about data. Let’s say that you have a book. The book in this instance is the data, and the metadata could be anything from the book’s author, the date it was published, the city it was published in, or the editor. Let’s say those four things are your metadata: You would use those four types of metadata to find the piece of data—a.k.a. the book —that you’re looking for.

When we talk about data stewardship, we’re talking about someone who not only defines their data, they define the metadata for it as well. That’s the person, for instance, whom the data engineers would go to if they’re finding issues with data in the pipeline when they’re implementing it.

Let’s say they have a question about which metadata should be attached to particular pieces of data for sales. If the VP of sales is the data steward for sales data within that organization, they’re also the person that data engineers would go to to ask key questions about data and its associated metadata.

This creates a culture of shared ownership across organizations that is really important because it gives everybody a voice in the room when it comes to the data in their domains. It promotes expertise of domain-specific data that lives outside IT. And that’s very important because you ultimately want to create shared ownership of data that transcends technical roles.

“This creates a culture of shared ownership across organizations that is really important because it gives everybody a voice in the room when it comes to the data in their domains.”

What are some of the challenges involved in maintaining data integrity, while dealing with all the pressures of government regulation, privacy, and so forth?

I see a few common challenges. One is that if you are mature enough to use advanced technologies, like AI and associated techniques like machine learning or natural language processing…

Let’s take machine learning as an example. If your organization is mature enough to use machine learning at your company for particular tasks, then by default your models are constantly going to ingest new data.

This means that if you do not have predefined standards of quality that all data should adhere to, then you run the risk of really degrading the quality of your models over time. That happens because the machine learning models are constantly ingesting new data to update their predictions and learn from those data.

Ultimately that model is finding patterns in the data to inform its results. If you do not have quality standards that your teams are looking for, then again, the risk of data drift and coming into negative adherence is very high.

The other challenge is that many organizations worry about the time involved in being a data steward. Many people already feel like they’re tight on time, too much is being asked of them, and governing data, especially if they’re not trained to work with data, is too much of an ask.

That’s why I think it’s very important, when you’re selecting data stewards, to look at people who already have to manage data as part of their daily tasks. You want to find people who already do this work. That’s a really key part of data stewardship.

Data stewardship is not a role that you put out a rec for. It is something where you look for people within your organization who are already managing key areas of data, your data domains if you will, and then leading those broader conversations about data quality and integrity.

“[A data steward leads] conversations about data quality and integrity”.

Now you might do a scan of your organization and find that you do not have enough data stewards to cover all domains. In that case, you need to be really strategic about writing job descriptions, such as for a VP of sales or a director of customer success.

You need to write in the job description that this person will be expected to act as a data steward, to define the quality standards for the data in their domains, and be able to explain those quality decisions to the rest of the org.

In that sense, you would hire a data steward from outside the organization to perform a particular function in the organization that is inclusive of managing data standards.

How do you develop and introduce this kind of oversight? How do you build a data governance board?

Building a data governance board starts with defining who the right people are to serve on that board. Death by committee is a real thing, and my big fear with implementing data governance in an organization is that if there is not enough time and attention invested in getting the right people on board and having them adhere to the right vision that they can execute, it can very quickly become yet another bureaucracy.

Frankly, that’s the broad view many people have of data governance today, especially if they move in fast-paced organizations where they’re expected to try anything and everything quickly. Governance, if not done well, can become a hindrance that makes people resentful.

And so in order to get a data governance council off the ground, that really starts with not just having a charter for the council that explains what you want to achieve. It also involves being very strategic about having every data domain in your organization defined and then having a representative per domain on the council.

It also involves getting together a core committee of people who can do the lion’s share of writing the charter, defining what the council will work on, and figuring out processes for approval.

For instance, what if a new term needs to get added to the data catalog? What do people do? Who’s writing this documentation so that everybody knows how they can add that new data to the catalog? Who will actually add that data to the catalog? These are very key questions that need answers, and if you’re going to scale out data governance across the org, people need those answers.

They will not be receptive to your efforts if you tell them, “We have a data governance council, and we’re doing data governance,” yet you can’t direct them to any resources or processes for them to follow. You really owe it to your organization to build that out for them.

I would argue that if you are a chief data officer or you are a senior data leader at your organization, that is the essence of your job. I think people still believe that the chief data officer role is largely that of a technical implementer.

That’s actually not the case at all. It’s really first and foremost a business strategy role that determines how data will be used, governed, and managed strategically across an organization. And so I often hear very senior leaders say, “This is too much work. We will do data governance once this project is finished.”

That is, frankly, the absolute wrong way to approach the challenge because you need to start with data governance before you involve anyone else in the organization in those efforts, and you need to start with data governance before you start choosing which priorities to pursue based on your business needs.

How does a data governance council develop a charter for data governance and a roadmap for how the data will be used within the umbrella of the organization?

I think the five key things that any leader of a data governance council should focus on involve defining a few things. The first is a purpose. You do need to explain why this group exists.

I guarantee people will be asking you why this group exists, so you need to explain that in one to two sentences, along with describing the council’s objective and the problems it will work on. For instance, if your data governance council’s role will involve approving all new data lakes or data warehouses—any data-focused tools—you need to say that. People need to know that the data governance council is where they will get this information.

You also need to define scope—what your data governance council will do and what they will achieve. This is also really helpful because by defining what your council will do, you also define what you won’t do. If someone comes to you with a particular request that’s out of scope, you can explain that.

You need to write down goals that align with the business strategy. And again, I see a disconnect between this all the time. I see very senior data leaders who want to implement new data projects and go straight to the architecture, without really asking, not only who is going to be using this architecture, but which business goals it will help achieve. So you need to define the goals.

You also need to discuss members of the council and their core responsibilities, along with defining what their particular roles will be. If your data governance council is large enough, you can also create subgroups or subcommittees who can work on very particular tasks. For instance, you could have a subcommittee whose role it is to review and approve data tools.

You could also have a respective subcommittee managing the data catalog, and their job could be to review and approve all prospective new terms that would go in the data catalog. That’s assuming you have enough people to fulfill those tasks.

If you don’t, the most important thing is to then define what the key data domains are in your organization and then appoint at least one leader of each domain to serve on the council. You also really should define what’s expected of attendees and what they should do regarding participation.

You need to describe and define the scope of the committee’s authority. You need to, again, when you’re talking about responsibilities, confirm in writing what the council has authority to do and what it does not have authority to do.

This is also where having a really strong sponsor comes in hand. I talk about this in the book, the concept of having a sponsor for your council who is someone even more senior than the council chair. That person not only gives resources to the council but they can also advocate for the council when your members are not in the room.

This really is an effort that needs to be supported by the C-suite. If ultimately your most senior leadership is not giving the time, money, and resources to help fulfill data governance, your efforts will not succeed.

That is really a stopgap to getting this work done, so in order to be successful to execute the charter, you do need an executive sponsor, and I explain how to find the right one in the chapter of my book that discusses the data governance council.

You talked about domains. What are the typical domains that would need people to be stakeholders on your council?

That is going to vary by every organization. It really depends on each organization and the essence of what they do. So for instance, let’s just say that you’re an organization that works with soil science and soil scientists. You are going to be managing geospatial data that is beholden to specific government regulation about metadata and how that particular data should be used. That legislation is going to inform everything about how you manage that very specific type of data.

That data is also unique in the sense that there are typically precedents to follow. You basically have predefined standards that have worked for decades, if not centuries, that you must abide by, given that is industry best practice, and that is science best practice.

In that case, you typically have folks in that organization who serve in technical roles. They are familiar with data and metadata. They often build models for various purposes, and they have very specific legislation that they are expected to abide by, or else they will either receive a fine or they’re at risk of losing funding.

On the flip side, you can have a software startup whose job it is to run on data collected by third parties. So that is a very different conversation about data domains because that really depends on the business model and what you’re trying to do: which data you are collecting and whom you’re collecting it on.

I mentioned earlier that European citizens have more rights to their personal data than Americans do, and that means that if you and your business are collecting data about anybody who lives in the European Union, you are beholden to legislation called GDPR, which gives a European citizen the right to request any data that you have on them.

You’re also required by law to explain why you collected that data on them, and if you cannot do that, you’re at risk of receiving a fine, which in many cases can be up to 6 percent of your annual revenue. We see large tech companies like Google and Meta being fined and sued regularly at this point, and they have enough money to pay those fines without too much fallout to their bottom line.

A smaller organization—the vast majority of organizations—cannot afford to lose that much in fines, and so the real risk they run by not establishing data domains and managing that data through governance is ultimately going out of business.

If someone decides to sue or request that information on them, if you can’t answer those key questions, you are at risk of losing everything that you’ve worked for. So defining your domains varies very much by organization, but it is something that’s really critical to do.

“If someone decides to sue or request that information on them, if you can’t answer those key questions, you are at risk of losing everything that you’ve worked for.”

Can you talk a little bit about data life cycle and how that plays a role in data governance, or do you automatically assume that data will be forever?

That’s a great question. I think many people today assume and treat data as if it is forever, and that is a huge problem I see.

There’s a lot of emphasis in the press about having data, as if the possession of large amounts of data in and of itself is a valuable thing. And the reality is, not only is it not inherently valuable, it’s actually an inherent risk because if you are hanging on to data for too long.

You could inadvertently possess data longer than you are legally allowed to, and that opens you up to liability concerns. Even if you’re not beholden to liability or legislative concerns, that’s expensive.

This data needs to be stored somewhere. I know we’re moving away from on-premise servers, but it needs to be stored in data centers or warehouses, and you pay for that. And so if you continue to hold on to data longer than you should, at bare minimum, you are wasting enormous amounts of money on data that you will never use for the most part, which is a huge problem.

If you have not defined, as part of your data governance, what your company’s process is for destroying data—we call that data destruction—that absence of a plan can result in a lot of lost money that your organization likely can’t afford. And that’s why having standards in place for which data gets destroyed when and who does it, is really important.

I mentioned earlier the example of machine learning and how those machine learning models are constantly ingesting new data to refine its results, and that is a reality of how machine learning models work. But if you do not have people managing those pipelines, reviewing the data that is new against your quality standards, if you don’t have people assessing when particular data has become out of date, or if certain data is biased against certain types of data, that’s where you run into real problems. We know, broadly speaking, that data quality is a huge concern. Data destruction and the lack of process for it is a huge contributor to that issue.

You just talked about the planned end of life for data. So from there, could you talk a little bit about how you plan for, minimize, and mitigate data loss through risk and third parties that might affect your organization?

Sure. And I understand this because this is something I work with clients on constantly. Especially if you are in an organization like the government, or in a healthcare organization, there are enormous risks with doing something like a migration because the risk of something like data loss has huge implications for users.

Very often it’s more common for them to stick with the devil they know, because they just don’t want to risk losing very valuable data that could be a liability for them.

I think that doing this well involves a few key steps, and there is a whole chapter in my book on how to practice what I called governance-driven development, which is the nitty-gritty of how you embed your data governance standards into your development processes.

This part of the process is really important because data governance is often viewed as purely conceptual, as theoretical, as something that impedes progress. As far as I’m concerned, if you don’t automate your standards into your development processes, then there’s no point in having those standards to begin with.

“If you don’t automate your standards into your development processes, then there’s no point in having those standards to begin with.”

The chapter in my book on governance-driven development talks about particular ways that you can embed governance and your standards into not only your dev life cycle but also your processes.

I use a case study of a Netflix migration in that chapter to illustrate data governance and how you can embed it into your processes and really make it a core part of your development team’s culture. I think the case study of what Netflix did and how they managed that really gives a lot of key lessons that readers can take away.

One example that sticks out with me is how the leader of that team at Netflix gave his team members room to fail. The team members were using Kafka clusters for the first time, and that, by nature, opened them up to mistakes being made because they were using technology that was new on the market at the time, and they didn’t have a lot of experience with it.

By default, that meant that they were going to make mistakes. So he actually orchestrated, essentially, fire drills where he gave his dev team members the opportunity to experiment with Kafka clusters in a closed environment, where any mistakes made would not be deployed out to Netflix users.

Ultimately the goal of any migration is for users to not know that you made a migration. If they do know, it’s largely because something went wrong, and so you never, of course, want that exposed.

“Ultimately the goal of any migration, is for users to not know that you made a migration.”

If you want to get your team comfortable with migrations, with using new technologies, with using data for the first time, I think creating controlled environments where they can experiment is a huge part of the process. It also allows nontechnical stewards to practice using data. There’s nothing like hands-on experience, and ultimately you do need everyone working in these environments to some capacity.

That doesn’t mean your marketing director’s going to become a data engineer, but you do need to have clear, consistent governance standards available in a shared environment that everyone can access and use. I think creating controlled environments is an important way to do that.

I also think that inviting strategic partners into your development is a key aspect of this work. I write in Chapter 5 about how utilizing open source communities can be a great way to invite new partners into your governance. You can then contribute to a new, up-and-coming open source project while using the open source technology for your own organization.

Two ways that stick out to me are getting your team comfortable with using new technologies by giving them the space to fail and inviting the right partners into your processes.

Your Netflix example had a lot of runway before they went with their final solution. How can you design into your data governance planning ways to respond to these kinds of challenges? For example, when government standards and regulations might change?

I talk in the book about how it’s important to have a data steward whose role is to keep informed of new legislation about data that would affect the business. This is really critical, because we all know that the world of data is changing rapidly, both at the technical and policy levels.
Right now, American consumers have very few rights over their personal data, but we do see a lot of conversations at the federal and state levels about how and why that should change, and you need someone on your team to monitor that.

I find that it’s very helpful to involve an attorney in your data stewardship processes. If this person is of counsel, it’s their job to look out for the organization, to make sure the business meets all legal requirements, and that they’re safe from litigation.

That person is an excellent data steward. They can make your data governance council chair aware of any upcoming legislation that your organization should know about. This colleague is likely not a data scientist or a data engineer, but they don’t have to be.

They do need to have a voice in the room when conversations come up about which data you’re going to use for which particular purpose, and why it’s going to be used, because if those key questions cannot be answered, you likely shouldn’t be using that data.

If the answers are going to change, or they have to change due to upcoming legislation, there needs to be a voice in the room making everyone aware of those changes. That’s why having somebody on the council in a stewardship role to manage legislative and ethics requirements is important, and I think someone with a legal background is perfectly positioned to serve this role.

Are there common, underlying philosophies or moral standards that should play a role in laying out a data stewardship plan, or does that vary by individual organizations? How does an ethics board play into this?

There is a lot of variance in data governance. Without knowing a particular organization’s industry, strategy, tool stack, or employees, you can only get so granular without more specifics.

I wanted to write a book about data governance that everyone can use and implement regardless of those variables. Having said that, as it pertains to ethics, there is a key guideline that everyone should follow, and that is being cognizant of indirect bias versus direct bias.

Direct bias is discrimination against someone that is overt and specifically based on protected characteristics. Protected characteristics are things like your race, gender, et cetera. It is illegal to discriminate against someone based on those characteristics, so any data used in a project for explicit discrimination is something that of course everyone should avoid.

The more challenging type of bias is indirect bias. When we discuss indirect bias in the context of data, specifically AI, indirect bias is a byproduct of sensitive attributes that correlate with nonsensitive attributes.

One example of this is how prospective homeowners of color are more likely to be denied mortgage loans in areas of the country. And it’s not necessarily because of their skin color overtly. In very many cases, it’s because the datasets used to approve people for mortgage loans are based on historical data that was collected when now-illegal practices occurred.
One example is redlining in the city of Portland, Oregon. Redlining was common in Portland for decades, and it has been illegal for a long time now, but because so much historical data was collected when redlining was legal, people of particular skin colors are more likely to live in certain zip codes. You can thus start using tools that inadvertently discriminate against people by correlating sensitive attributes (skin color) with nonsensitive attributes (zip codes). And that’s when things can spiral.

From a legal standpoint, you of course should declare sensitive attributes and prohibitive attributes out of bounds for training models, unless you can explicitly justify why they were used. Even in those cases, a lawyer would probably advise against that. But you also need to make sure that your technical teams are aware of how their models are being trained and where indirect bias can occur.

This has been a challenge in the past when teams use black box algorithms, because a black box algorithm means that you can’t see how these variables engage each other during training. That opens up the team and organization to risk when they can’t see how those decisions are being made. So it’s certainly something that you want to account for, regardless of your company.

There was a fairly famous case, where Amazon built a recruiting tool that used data that showed bias against hiring women. Are there ways to create a more rigorous data framework and oversight that detects this bias earlier?

There are. I speak in Chapter 1 of the book about finding a data framework for your organization. I use Gartner’s seven-step data framework as an example. Ethics and transparency is one of those steps.

The purpose of a framework is to take a set of standards and apply them at the outset to your data-focused projects. That is a big reframe of how data is often used today. Just like data governance itself, data ethics is often conceived of as something that you do at the end of a project once an algorithm has been deployed to the public.

The truth is that not only does this have unethical consequences for users of that product, who did not consent to being discriminated against or to being evaluated by this biased algorithm, it also creates technical debt for teams.

I like to say that ethical debt is technical debt, because when you find bias to that degree, the only recourse is to scrap your model and retrain it back to before you found the bias. That involves a lot of added work for your team to reperform.

“I like to say that ethical debt is technical debt.”

In response to anybody who tries to argue that data governance takes too much time and that establishing ethics standards is too laborious and is going to delay innovation, what ends up happening is that then these tools are deployed, they are found to be biased, and not only do they cause serious brand reputation, the algorithm must be taken off the market.

Now you’ve just wasted enormous time, money, and resources on an algorithm that had a detrimental impact on your organization’s brand, and, more importantly, on unsuspecting users. I would say that any framework you use for data governance has to not only include ethics but ensure that ethics and transparency are included at the outset.

I called this book Designing Data Governance because I believe that data governance is a design challenge. If models and structures are designed to have governance automated throughout, then the chances of success increase.

OpenAI’s ChatGPT is creating ripples in the news right now. A lot of that derives from the public’s new awareness of the models’ existence, but also how those models will be used for both positive and negative outcomes. Can you say anything about how this sudden interest affects data governance and the public?

I worry that the conversations about ChatGPT are creating more noise that detracts from fundamental challenges with data governance and maturity. I think things like ChatGPT reinforce to the average person that AI is this entity that has some degree of sentience when the reality is that AI is just data.

If you are not using data that upholds standards, is designed to be ethical, uses the right metadata, and embeds the right processes throughout your pipelines, then the conversation about tools like ChatGPT is never going to progress. People are not going to gain the degree of data literacy that is needed to successfully do data governance.

That creates an opportunity for organizations who do invest in, not only data governance standards, but teaching those standards to their employees and sharing decisions about that data with their employees. Not many organizations are doing that right now.

If you are willing to invest the time, money, and resources into data literacy programs and establishing those standards, that gives you a huge leg up on your competition. I do worry, though, about the conversations that are happening around ChatGPT, because to me, all they do is reinforce current problems with perceptions about AI.

How can people follow you and stay in touch with what you’re doing?

They can follow me on LinkedIn and Twitter.

They can also subscribe to my LinkedIn newsletter, which is called Designing Data Governance from the Ground Up and is published biweekly every Monday. It shares design thinking tips to build a data-driven culture. So folks who want insights on that topic can subscribe to the newsletter on LinkedIn.

I am on Twitter under my full name, Lauren Maffeo, although I go on Twitter less often these days.

I periodically write for the PragProg Medium blog on data governance topics as well, so if folks subscribe to the PragProg blog on Medium, you can find my writing there.

I would say also, the book will be available on February 21, 2023 in print.

It is currently available through PragProg in beta, and it is available for pre-order through bookstores like Barnes & Noble.

If you’re interested in learning more about the topics discussed today, I would love for readers to buy the book.

Thank you for taking the time to chat. This was amazing and we learned a lot.

Erica · 9 February 2023 20:04

Drop any questions or feedback you have for Lauren Maffeo (@lmaffeo) into this thread. Everyone commenting or asking a question will automatically be entered into a drawing to win a copy of her Designing Data Governance from the Ground Up ebook!

If you don’t want to wait, you can pick up your copy of Designing Data Governance from the Ground Up: Six Steps to Build a Data-Driven Culture today!

Don’t forget you get 35 percent off with the coupon code devtalk.com!

AstonJ · 10 February 2023 22:10

Another great spotlight and another book I need to read

I’m curious what you think about some of the recent legal action in the AI space Lauren (We've filed a lawsuit against GitHub Copilot and We’ve filed a lawsuit challenging Stable Diffusion). Do you think companies should get explicit permission before using the work of others in their models? (Apologies if you’ve already covered this, I haven’t finished reading the spotlight but wanted to post this while it’s fresh in my mind )

DevotionGeo · 15 March 2023 08:33

Hi @lmaffeo!
Welcome to the forum!

Congratulations for your book Designing Data Governance from the Ground Up! This is a very good topic.

You mentioned that there is an abundance of data without any quality control, and that low-quality data is often consumed. I think this is why Github Copilot sometimes generates code that requires more time to fix than to write from scratch. Copilot uses a vast amount of data available on Github, but not all of it is of good quality.

bot · 16 October 2023 16:45

Hello everyone!

I’m your friendly Devtalk bot

Thank you to all of you who participated in our Spotlight AMA!

This is now closed and all of those who commented above have been entered into the draw - meaning we’re now ready to pick a winner!

The process will be initiated when somebody clicks the special link below:

Devtalk - Dev forum at Devtalk - the forum for developers!

Don’t be shy, we need one of you to help make the magic happen!

bot · 16 October 2023 19:05

Thank you for initiating the draw process…

Entering the following members into the draw…

@DevotionGeo

bot · 16 October 2023 19:06

And the winner is…

Drum roll…

bot · 16 October 2023 19:06

Congratulations @DevotionGeo you are the chosen one!! We’ll be in touch about your prize via PM soon

Thank you everyone who entered, and of course @lmaffeo for taking part in our Spotlight - thank you!

Erica · 17 October 2023 17:00

Congratulations @DevotionGeo! Check your mail for your copy.