You should know what you’re dealing with before developing with data.
How you approach your data queries and the inputs you use can have significant impact on your data output and match rates.
From simple tricks to thoughtful approaches, there are lots of ways to optimize your data usage for better results, lower cost and a lot of saved time.
In this post, we’ll cover all the tips and tricks that we regularly share with developers and data professionals when we help them integrate a data API into their software/database.
*The tips below apply to any type of data project – but the examples are people data-specific since it’s our expertise.
Skip to a section: datasets | sampling | queries & providers | working with results
Lets start off with what you should expect. Some types of data are easy to come by, while others require searching, matching and even guessing.
How sure are you that the data you need is available and how easy is it to find that data?
Let’s say you want to find Facebook usernames by querying Twitter ids. Would you expect to find a Facebook profile for each Twitter user id? Probably not – some of your user ids may be for corporate accounts that don’t have Facebook profiles, some may be spam accounts and others may be genuine Twitter users that simply don’t use Facebook.
You can judge your results better when you have a good idea of what to expect.
There are two types of people data – online (email addresses, social profiles, etc.) and offline (background reports, mailing address, etc.).
Both require you query a piece of known information, like an email address or social security number, and the data service will search for information it can match to the same person. In a lot of these cases, getting a 20-30% match rate is considered very good (depending on what you’re searching with and for).
Keep in mind that online data is often incomplete, vague and inconsistent, so match rates can be somewhat lower.
The quality of your dataset will determine how successful your queries will be. This should be obvious, but many people don’t realize how much of an impact your dataset has.
For instance, searching for information about people using Twitter usernames results in a lower match rate than using LinkedIn usernames.
Why?
Unique data is data that can lead to only one outcome, such as an email address can only belong to one person.
The chances of receiving a definite result are much higher when you query unique data.
If your query is generic, like “Nancy Green,” it’ll be hard to find information since there’s no way to determine which of the many Nancy Greens is the one you’re searching for.
You can make your query less generic by adding more information like an address or city, age, username, etc. Finding information about “Nancy Green of Hoboken, NJ” is much easier than just “Nancy Green.”
For people data, the following are usually unique data inputs:
Next, look at how connected your queried data is. Connected data is more likely to lead to fuller results.
A good example is Pinterest usernames vs. Twitter usernames. Pinterest is connected, since it often includes your Facebook and Twitter profiles, real name and location. Twitter usernames are more isolated since profiles often don’t include links to other social profiles, a real name or bio. That doesn’t mean that Twitter profiles don’t return data – just they’re less likely to return a lot of data than are Pinterest profiles.
Connected data inputs include:
You’ll also need to think about the data you want to receive. If you’re looking for job titles, LinkedIn usernames or work emails are useful. Looking for social profiles? Name, address and email work well.
People data sources that are rich in data:
The last thing to keep in mind is how clean and ordered is your dataset, i.e. international phone numbers should to include country codes and names should be full names.
If your data is vague, contains duplicates, inaccuracies or isn’t normalized, you’ll probably end up with fewer results than expected.
It’s important to run a few tests before choosing a data provider or even deciding the best query structure.
Make sure to use a large enough sample to evaluate the data provider when running your tests. I’ve seen many customers use only 1-2 queries as a test. This is a big mistake because a) a sample of two isn’t significant b) the two people in the sample usually don’t represent the people in the actual dataset.
To determine your optimal sample size, you need:
There are plenty of sample size calculators online.
A good sample will represent the dataset. There are lots of sampling methods out there but using a random sample is usually the way to go.
Choosing a random sample is easy and can be done in Excel:
The dataset is good, your sample size is perfect and now you’re ready to query data. Guess what? There’s still plenty of room for improvement.
Below are the data query optimization techniques we’ve found work best.
More data means less guessing and more data points to explore. As you add more data, your query becomes much more specific (ex: “Nancy Green from Hoboken” vs. just “Nancy Green”) which makes it easier to find a definite result. You also can make it easier to find more data by adding different types of data points together.
For instance, combining “Nancy Green from Hoboken” with “nancy.green@acme.com” gives the data provider both offline and online inputs with which to search. A name and address can easily lead to a phone number and relatives while an email address can lead to social profiles, job title and employer.
Try to make sure you use inputs that can be connected to definite matches. This includes email addresses, usernames for specific social networks, mobile phone numbers, etc.
See if you can you add parameters to make something more specific (ex: we allow you to target a social network with an “@service” parameter vs. just searching a username). Alternatively, try breaking up the results – like using keywords in a Twitter bio to infer data about a person.
How accurate does the data need to be? If you’re using data for marketing or need a high-level overview, the data often doesn’t need to be 100% accurate.
In cases where you can afford a small margin of error, there’s plenty of additional information you can infer.
When it comes to people data, the types of data that can be inferred are:
A common feature of data APIs is the search pointer. A search pointer is a special string that can be used for follow-up searches to your initial query. This is very useful since you can often get a lot more data if you dig a bit deeper.
In our API, a search pointer is given when we can’t find a definite match to your query. Each search pointer suggests a possible match and includes a match score to help you choose your follow-up search.
Use and/or when possible for more flexibility.
We had a user who once quadrupled his match rate by adding country codes to the list of phone numbers he was querying.
It’s important to think about the data your querying and to make sure you’ve included any variations that might arise due to different languages, spelling, local standards, etc.
Even if your dataset is perfect and your queries are top-notch, you’re not done yet! You can still save yourself a lot of time and money.
Be very sure about the information you need for your project. You don’t always need to use as much data as possible. You can save a lot of money by having a very clear idea of what the minimum amount or types of data you need, testing and then scaling up if necessary.
For a lot of startups, the cost of data is very important.
We’ve been approached in the past by small startups that can’t afford premium data, but need it for their project. The solution that’s worked best for them is to use a cheaper data provider for the simple lookups while using us for the rest.
This doesn’t produce the best results, but we’ve been told it works better than not accessing premium data at all.
Every data provider has its strengths and weaknesses. Data is not a zero-sum game even when two data providers offer the same solution.
It’s always worth testing providers on several types of data to get a good idea when to use each provider.
Making sure your data is error-free is an important step.
For simplicity’s sake, I’ll just list some tips to QA your own results.
Types of data analysis, University of Jyväskylä Koppa
The last step is analyzing your data. Since there’s too much to cover in one blog, here are a few tips:
And a few resources to help your analysis:
Data is a lot more complicated and than most people think, which is why it’s so important to understand your data before you start developing with it. Proper research and planning can make a huge difference in how much data you’re able to get, how accurate it is, how long it takes to get it and how much it will cost.