Share this
Developing with Data: How to Save Time & Get Amazing Results
by Admin on Aug 25, 2015
You should know what you’re dealing with before developing with data.
How you approach your data queries and the inputs you use can have significant impact on your data output and match rates.
From simple tricks to thoughtful approaches, there are lots of ways to optimize your data usage for better results, lower cost and a lot of saved time.
In this post, we’ll cover all the tips and tricks that we regularly share with developers and data professionals when we help them integrate a data API into their software/database.
*The tips below apply to any type of data project – but the examples are people data-specific since it’s our expertise.
Skip to a section: datasets | sampling | queries & providers | working with results
Setting proper expectations
Lets start off with what you should expect. Some types of data are easy to come by, while others require searching, matching and even guessing.
How sure are you that the data you need is available and how easy is it to find that data?
Let’s say you want to find Facebook usernames by querying Twitter ids. Would you expect to find a Facebook profile for each Twitter user id? Probably not – some of your user ids may be for corporate accounts that don’t have Facebook profiles, some may be spam accounts and others may be genuine Twitter users that simply don’t use Facebook.
You can judge your results better when you have a good idea of what to expect.
People data
There are two types of people data – online (email addresses, social profiles, etc.) and offline (background reports, mailing address, etc.).
Both require you query a piece of known information, like an email address or social security number, and the data service will search for information it can match to the same person. In a lot of these cases, getting a 20-30% match rate is considered very good (depending on what you’re searching with and for).
Keep in mind that online data is often incomplete, vague and inconsistent, so match rates can be somewhat lower.
Getting the right dataset
The quality of your dataset will determine how successful your queries will be. This should be obvious, but many people don’t realize how much of an impact your dataset has.
For instance, searching for information about people using Twitter usernames results in a lower match rate than using LinkedIn usernames.
Why?
- LinkedIn usernames are almost always the person’s first and last name, while Twitter usernames are free text.
- Twitter doesn’t require you add any personal details while LinkedIn profiles are full of personal details.
- Many Twitter usernames belong to corporate or junk accounts which won’t return any data, while LinkedIn usernames almost always belong to a real person.
Unique vs. Generic
Unique data is data that can lead to only one outcome, such as an email address can only belong to one person.
The chances of receiving a definite result are much higher when you query unique data.
If your query is generic, like “Nancy Green,” it’ll be hard to find information since there’s no way to determine which of the many Nancy Greens is the one you’re searching for.
You can make your query less generic by adding more information like an address or city, age, username, etc. Finding information about “Nancy Green of Hoboken, NJ” is much easier than just “Nancy Green.”
For people data, the following are usually unique data inputs:
- email address
- username for a specific social network (i.e. username@linkedin)
- mobile phone number
- social security number, bank account number, etc.
Connected vs. Isolated
Next, look at how connected your queried data is. Connected data is more likely to lead to fuller results.
A good example is Pinterest usernames vs. Twitter usernames. Pinterest is connected, since it often includes your Facebook and Twitter profiles, real name and location. Twitter usernames are more isolated since profiles often don’t include links to other social profiles, a real name or bio. That doesn’t mean that Twitter profiles don’t return data – just they’re less likely to return a lot of data than are Pinterest profiles.
Connected data inputs include:
- email addresses
- Facebook, LinkedIn, Pinterest, Angelist profiles
- name & address
- phone number
You’ll also need to think about the data you want to receive. If you’re looking for job titles, LinkedIn usernames or work emails are useful. Looking for social profiles? Name, address and email work well.
People data sources that are rich in data:
- phone directories
- public records
Data cleanliness
The last thing to keep in mind is how clean and ordered is your dataset, i.e. international phone numbers should to include country codes and names should be full names.
If your data is vague, contains duplicates, inaccuracies or isn’t normalized, you’ll probably end up with fewer results than expected.
Choosing the right sample size to test
It’s important to run a few tests before choosing a data provider or even deciding the best query structure.
Make sure to use a large enough sample to evaluate the data provider when running your tests. I’ve seen many customers use only 1-2 queries as a test. This is a big mistake because a) a sample of two isn’t significant b) the two people in the sample usually don’t represent the people in the actual dataset.
To determine your optimal sample size, you need:
- The size of your dataset. This is the total number of unique queries you plan on making. If you’re using the People Data API, that would be the total number of people whose data you want to enhance.
- Margin of error. How precise the results need to be. For example, if your margin of error is 5% and your sample match rate is 20%, then you can expect 15 – 25% match rate for your entire dataset (20% ± 5%).
- Confidence level. How reliable the results need to be (usually 90%, 95%, 99%). This means that if you had to repeat this test, what is the likelihood you’d get the same results again.
There are plenty of sample size calculators online.
How to choose a random sample
A good sample will represent the dataset. There are lots of sampling methods out there but using a random sample is usually the way to go.
Choosing a random sample is easy and can be done in Excel:
- Add your dataset to your spreadsheet.
- Create a new column called “Random Number” (or whatever you want).
- Enter “RAND=()” into the first cell below the header of the new column.
- Copy and paste the formula into the cells beneath it.
- Select all of the randomly generated numbers, copy and paste them back into their cells, choosing to paste them as values.
- Sort the table from smallest to largest based on the number in the “Random Number” column.
- Choose your sample starting from the top.
Tips for optimizing your queries
The dataset is good, your sample size is perfect and now you’re ready to query data. Guess what? There’s still plenty of room for improvement.
Below are the data query optimization techniques we’ve found work best.
The more data, the better
More data means less guessing and more data points to explore. As you add more data, your query becomes much more specific (ex: “Nancy Green from Hoboken” vs. just “Nancy Green”) which makes it easier to find a definite result. You also can make it easier to find more data by adding different types of data points together.
For instance, combining “Nancy Green from Hoboken” with “nancy.green@acme.com” gives the data provider both offline and online inputs with which to search. A name and address can easily lead to a phone number and relatives while an email address can lead to social profiles, job title and employer.
Leveraging parameters
Try to make sure you use inputs that can be connected to definite matches. This includes email addresses, usernames for specific social networks, mobile phone numbers, etc.
Build up your queries or break down your results
See if you can you add parameters to make something more specific (ex: we allow you to target a social network with an “@service” parameter vs. just searching a username). Alternatively, try breaking up the results – like using keywords in a Twitter bio to infer data about a person.
Confidence levels/inference
How accurate does the data need to be? If you’re using data for marketing or need a high-level overview, the data often doesn’t need to be 100% accurate.
In cases where you can afford a small margin of error, there’s plenty of additional information you can infer.
When it comes to people data, the types of data that can be inferred are:
- gender
- ethnicity
- location
- work email address
- language(s)
Use search pointers when available
A common feature of data APIs is the search pointer. A search pointer is a special string that can be used for follow-up searches to your initial query. This is very useful since you can often get a lot more data if you dig a bit deeper.
In our API, a search pointer is given when we can’t find a definite match to your query. Each search pointer suggests a possible match and includes a match score to help you choose your follow-up search.
Operators
Use and/or when possible for more flexibility.
Think big picture – cultural, national, geographic
We had a user who once quadrupled his match rate by adding country codes to the list of phone numbers he was querying.
It’s important to think about the data your querying and to make sure you’ve included any variations that might arise due to different languages, spelling, local standards, etc.
Ideas to save time & money
Even if your dataset is perfect and your queries are top-notch, you’re not done yet! You can still save yourself a lot of time and money.
Develop first with less data
Be very sure about the information you need for your project. You don’t always need to use as much data as possible. You can save a lot of money by having a very clear idea of what the minimum amount or types of data you need, testing and then scaling up if necessary.
Supplement your data if possible
For a lot of startups, the cost of data is very important.
We’ve been approached in the past by small startups that can’t afford premium data, but need it for their project. The solution that’s worked best for them is to use a cheaper data provider for the simple lookups while using us for the rest.
This doesn’t produce the best results, but we’ve been told it works better than not accessing premium data at all.
What are the strengths and weaknesses of each provider?
Every data provider has its strengths and weaknesses. Data is not a zero-sum game even when two data providers offer the same solution.
It’s always worth testing providers on several types of data to get a good idea when to use each provider.
How to QA your results
Making sure your data is error-free is an important step.
For simplicity’s sake, I’ll just list some tips to QA your own results.
- Always check multiple results; you need to see if there is a pattern or a one-time error.
- Look for suspicious results, such as results with too much or too little data.
- Analyze your results – don’t just look at statistics. For instance, if you find a person with two birth dates, are they both within a reasonable time frame?
- Check the sources of incorrect results. You need to know why you received incorrect results and checking their sources is the best to find out.
- Be methodical with QA and have a structured process.
Analyzing your results
Types of data analysis, University of Jyväskylä Koppa
The last step is analyzing your data. Since there’s too much to cover in one blog, here are a few tips:
- Think about what you want to get out of the data, what are the perfect results and how would you find them?
- Look at both the quantitative and qualitative results
- Normalize and complete your data
- Establish your goals/KPIs
- Choose your tools carefully
- If you’re automating the analysis, keep in mind that automating complex data analysis is very difficult
And a few resources to help your analysis:
- Get some useful Excel tutorials for data analysis via Excel Easy.
- A great list of data analysis tools for businesses.
- A useful cheat sheet full of SQL commands.
Integrating data into your application
Data is a lot more complicated and than most people think, which is why it’s so important to understand your data before you start developing with it. Proper research and planning can make a huge difference in how much data you’re able to get, how accurate it is, how long it takes to get it and how much it will cost.