Premise Data Corporation: Food Staples Indexes
Methodology, Technical Details, and Documentation
November 2014
1
1.0 Overview of the Premise Food Staples Index
Premise develops Food Staples Price Indexes (FSPI) and other economic and
business indicators in over 20 countries around the world. Premise is providing
free access to six indexes (Argentina, Brazil, China, India, Liberia, and the
United States) through data.premise.com, our online reporting portal. Each
FSPI tracks data on daily changes in the prices paid by urban consumers across
a representative basket of core food staples.
The Premise FSPI methodology was developed by an in-house data science team,
with consultative guidance from Alan Krueger (former Chairman of the White
House Council of Economic Advisers), Hal Varian (Chief Economist at Google),
and Matthew Granade (former Head of Research at Bridgewater Associates).
1.1 Harmonized Index Construction
The Premise FSPI sampling and index construction methodology is harmonized
across 20 actively monitored countries and covers fourteen component indexes:
1. Beverages
2. Dairy & Eggs
3. Fish & Seafood
4. Fruit
5. Grains & Nuts
6. Herbs, Spices, and Condiments
7. Meat
8. Oils & Fats
9. Other Snacks
10. Processed Fruits and Vegetables
11. Processed Grains
12. Processed Meats
13. Sweets
14. Vegetables
For each country, the data.premise.com portal (DPC) provides access to these
fourteen component indexes as well as the weighted average, overall FSPI for
that country.
These fourteen categories were selected and weighted by harmonizing the official
governmental price index releases in the twenty FSPI countries. When possible,
Premise’s aims are to collect a broader basked of food prices than are traditionally
tracked.
2
1.2 Component-Level Index Construction
Each component index is a weighted combination of basic indices capturing the
price movements of specific classes of products and items.
For example, in India, the vegetables index component is comprised of a weighted
combination of the following six basic indices:
1. Fungi
2. Gourds
3. Leafy Green Vegetables
4. Root Vegetables
5. Savory Fruits
6. Tubers
1.3 Item-Level Sampling
The most granular level of categorization in the FSPI is a specific product variant,
e.g. “Yellow Banana [1 medium bunch]” or “Starkist Canned Tuna [12oz]. These
items make up the elementary product indexes in each FSPI. The typical FSPI
contains over 1000 of these elementary series, covering a wide variety of brands
and packaging variants.
Elementary series are chosen by:
1.
Referencing official releases made by government statistical agencies and
monetary authorities (a full list of sources can be found in Section 5.0).
2.
Consulting regional experts in each of the five countries to determine if
there were any goods and services that were not explicitly tracked by the
country’s statistical agencies but were still important staple products in
those countries.
By involving regional experts, Premise can tailor categories from country to
country that account for the importance of certain regional items.
For example, in Brazil, the basic index of shark products within the fish & seafood
index includes chilled and frozen dogfish - a product that is not particularly
popular in neighboring Argentina and thus is not a part of Argentina’s FSPI.
Each component is divided into subcomponents that provide further classification
for the set of products. In Argentina, for example, the meat component is
divided into four subcomponents - beef, chicken, lamb, and pork. A sample of
the elementary product indexes contained in the beef subcomponent are shown
below in their taxonomic classification:
1. Beef
3
a. Prime Cut
i. Filet Mignon
ii. Tenderloin
iii. Topside
iv. Round-eye
v. New York Strip
vi. Tri-tip
b. Standard Cut
i. Brisket
ii. Rump
iii. Shank
iv. Flank Steak
v. Short Ribs
vi. Shredded Beef
c. Other
i. Ground Beef
ii. Beef Milanesa
iii. Beef Naval
iv. Beef Rib Eye
v. Beef Round
vi. Beef Tongue
vii. Cow Stomach
viii. Skirt Steak
ix. Beef Liver
1.4 Weighting
Elementary indexes are weighted based on their relative importance as shown by
the country’s consumer expenditure survey. See Section 5.0 for a list of sources.
1.5 Example
A diagram outlining the taxonomic classification processed described above is
shown on the following page:
4
1.6 Time Series Construction
To convert raw observational data into indexes, we use a methodology that
mirrors that used by the United States Bureau of Labor Statistics (BLS) in
constructing the CPI. Indexes are constructed for each basic product type and
then combined using weights from the product hierarchy to derive indexes for
the subcomponent, component, and headline levels.
The key difference is that because our data is sourced from a distributed network
of contributors, we take particular care in adjusting the sample of observations
used to construct indexes, such that the composition of the sample is close to
constant over time. Unlike the BLS, our sample of observations is not necessarily
the same from period fo period. Our contributors are remote and at arms-length.
Hence we cannot guarantee that our raw sample of prices represent the same
set of products at the same places from period fo period. Because our raw
sample varies over time, we use a methodology that minimizes the effect of
these variations in the sample. Neutralizing the effects of sample variations is
important as it allows us to interpret the index in the typical way, as the change
in prices for a fixed basket of goods.
One straightforward way of creating a daily index is to compute the average
price each day. However, this approach exposes daily index values to changes in
environmental variables driven by changes in the sample.
We control for changes in sample composition by constructing indexes for groups
of observations at the level of product-place-package. Rather than computing
the average price for apples from all apple observations, we compute the average
price of apples, from the average price of apples at supermarket S in city C, for
all S and C. Hence changes in sample composition from period fo period are
neutralized. This decomposition requires a much higher volume of data, which
5
poses problems for traditional data collection yet is possible for Premise because
of the volume of data that we collect.
Collating more than a hundred product-level indexes into a single country-
level headline food staples index for a country requires a product hierarchy
with associated weights. The construction of Premise’s product hierarchies are
described earlier in sections 1.1-1.3.
The headline index therefore represents movements in the weighted average
expenditure per household, with weights based on each household’s expenditure.
Thus, to calculate the FSPI for each country, the following steps are taken:
1. Compute product-level indexes based on average prices for all products:
a.
Find combinations of package types, stores, cities (“identity keys”)
characterizing the observations for the product.
b. Calculate the average price each day for each of these identity keys.
c.
Choose a common base period and normalize these identity key level
indexes by the value at the base period.
d.
Aggregate up to the product level by calculating a weighted average.
2. Given product-level indexes:
a. Group products by their nearest parent node.
b.
Within each group, construct a group-level index by calculating a
weighted average, where weights are taken from the product hierarchy.
3.
Apply step 2 for each level of indexes in the product hierarchy until the
initial node is reached, which is the headline index.
2.0 Data Capture
Data comprising the Premise FSPI is collected in real time, on the ground by
data contributors armed with photo-enabled smartphones. The contributors
work together to complete a full sampling frame each month.
Each country’s sampling frame consists of 10s to 100s of thousands of prices per
city per month, covering a wide geographic extent and multiple store types, like
traditional retail store settings as well as other local points of commerce such as
open-air markets.
2.1 Contributors: Composition, Compensation, and Train-
ing
Contributors are compensated by Premise when they submit observations that
successfully pass the quality control process. See section 3.0 for more information
on quality control.
6
Contributors are trained by the Premise operations team to properly enter all
relevant metadata for each price observation made through the Premise mobile
application. This ensures that each data point captured is of the highest possible
quality.
2.2 Contributors: Mobile Capture Process
Every day, a list of items is delivered to each contributor through the Premise
mobile app. This list contains tasks - the items which that contributor is required
to find and for which he is asked to record the price in the physical retail outlets
in their coverage zone. The set of tasks a contributor is required to complete
represents a small, localized slice of the full sampling frame.
In order to capture a complete sampling frame, the Premise mobile app imposes
dynamic quotas on the number of observations that can be amde for each
product. This ensures that contributors are providing a consistent stream of
pricing information for each and every product that comprises a country’s FSPI.
After receiving their list, each contributor visits multiple physical retail locations
in their zone - supermarkets, local mom-and-pop shops, open air markets, street
vendors - to find the items on their list.
When a contributor finds a specific item matching something on their list, they
will record the price of that item and any associated metadata. Furthermore,
they take a photograph, which must clearly indicate the price, brand, size, and
quantity information, to ensure that the data recorded by the contributor can be
easily verified. Section 3.0 has more information on the quality control process.
Captures are transmitted back to Premise servers where the information is
aggregated into the pricing indexes in real-time.
2.3 Sampling Design
Although all contributors are assigned tasks, there remains scope for the con-
tributors to submit observations that are outside of their assigned tasks. This
flexibility creates a crowd sourced dimension to our data, which is useful for
spotting price trends that would otherwise be missed and discovering new prod-
ucts. Aside from this crowd sourced aspect, the bulk of our data is collected
within the assigned task model.
3.0 Quality Control
Since our sampling methodology relies on non-traditional data sources and data
contributors have varying degrees of technical expertise, ensuring data integrity
is a major focus. Post-capture, on-the-ground observations are submitted to a
7
rigorous quality control (QC) process. The QC process is designed to eliminate
noise introduced by both accidental and deliberate contributor error (fraud). The
methodology consists of a combination of automated machine learning techniques
and input from human experts.
Premise detects anomalies automatically using statistical and machine learning
models that have been trained on human-labelled data. These models detect
deviations from historical distributions of a given metric, conditional on additional
observed meta-data (where relevant). For example establishing tight bounds on
the expected price of 100g of Flank Steak at a Carrefour in central Buenos Aires.
These models are also used to flag deviant contributor behavior, e.g. by identifying
significant auto-correlation or cross-correlation with another contributor.
Premise QC experts leverage their local knowledge to assess whether the meta-
data associated with a particular observation is reasonable. In addition to
checking whether the submitted price is correct for a particular good, QC
experts also flag observations when the combination of parts of the meta-data
are inconsistent. For example, a price and location for an observation may each
be independently reasonable but unreasonable when considered in conjunction
with the photo of the product/location. For this reason, Premise emphasizes the
collection of photographic evidence of each product in question, in addition to
other qualitative and quantitative data points.
3.1 Outliers
Even the most well-intentioned and careful contributor makes mistakes when
capturing data. The primary cuase of these errors are the result of erroneous
input. For example, a misplaced decimal point when recording the price of an
item, the selection of the wrong product type, etc.
Our strategy for detecting these types of errors is based on product-specific
algorithms that find outliers within particular dimensions of a product. For
example, with respect to the size of Coca-Cola bottles, we observe an empirical
3-dimensional distribution of size-quantity-unit triples that contributors have
submitted in the past. Using this history of previously submitted size-quantity-
unit triples, we can find size-quantity-unit triples that are improbably given past
submissions. Observations with rarely observed size-quantity-unit triples are
then flagged for further inspection.
To detect price outliers, we attempt to find data that are considerably different
from data that have passed human based checks. In practice, this involves
identifying observations that fall outside bounds from which “correct” data
are unlikely to fall. There are multiple dimensions by which we can assess an
observation. The obvious candidate for assessment is prices, although packaging
and location are additional dimensions that are informative. To detect outliers
based on the price dimension, we exclude observations with log prices that
are a great enough distance from the mean log price for the same product in
8
similar packaging. The cutoff for valid observations is based on a multiple of the
standard deviation, which incorporates the natural heterogeneity in the variance
of prices across products. This algorithm is based on the fact that prices are
typically log normally distirbuted. Note that when calculating the estimates
of the mean and variance per product, we use the trimmed distribution which
excludes the left and right tails. Using the trimmed distribution neutralizes the
effect of outliers which would otherwise artificially distort the estimates of the
mean and variance.
Outlier detection on this mean-standard deviation algorithm works well with
continuous variables but is not directly applicable to discrete variables such
as pakcage type (which is the product of the continuous distributed size, and
discrete variable unit, e.g. 500ml.) Instead, we use non-parametric methods when
the variable is discrete, which involves trimming infrequently observed variables
where the cutoff is determined empirically from perviously human labelled data.
Key to the outlier detection process is continual improvement or fine-tuning as
we collect more data. With relatively sparse data, it is difficult to determine
whether a given obseration is a true outlier using statistical based methods alone.
Therefore, when capturing data in a new country or of new products, we tend
to rely on human based methods. As the number of human verified observations
increases, we shift towards a more statistical based approach as the performance
of our statistical models improves.
Our human QC process is based on a wisdom of crowds approach. Rather than
subjecting an observation to a complete review by a single human, we subject
each individual component of an observation to many reviews by different people.
Reducing the scope of a review task from a complete review to a simple binary
question and having multiple independent reviews both increase the accuracy
of human QC. To combine multiple reviews into a single QC outcome for an
observation, we rely on an in-house voting classification algorithm.
3.2 Contributor Fraud
Inevitably, part of the data quality process at Premise is detecting contributor
fraud, which typically involves the intentional submission of falsified data. De-
tecting contributor fraud is nontrivial since contributors committing fraud often
take exceptional care in submitting reasonable data to avoid detection. Therefore,
the statistical based methods described above are less useful for detecting fraud
since fraudulent observations mimic real observations.
In our experience, the majority of fraud involves:
1. Submitting obserations of different items
2. False locations/places
3. Submitting a photo that has previously been submitted
9
4. Altering a photo and reusing it in a new submission
5.
Using images from the internet for use in a new submission and then
falsifying the meta-data of the product to form a complete submission.
The first two of these cases represent the most common type of fraud and
addtionally, the easiest to detect. Submitted captures have to be accompanied
by a photo of the item. Using a combination of computer vision and human
based checks, it is straightforward to detect an image that is dissimilar to other
valid images of the same product. Similarly, submissions are geotagged using
the GPS capabilities of the contributor’s phone and cross-referenced against our
database of locations. Therefore, mismatches between the actual location of
capture and the required location of the task are easy to detect.
The final three cases are centered on submitting photos that are not original
or misleading. To prevent photo-based fraud, Premise submits each photo to
a comprehensive review that compares each photo to all previously captured
photos. This comparison also includes relevant internet images due to Premise’s
additional experience collecting data points from internet sources. The simplest
and fastest method that we use is based on “hashing” images which quickly finds
exact matches. To detect more complex photo-based fraud, we use proprietary
methods for detecting similar photos, which are immune to scale, rotation, and
crop-based changes.
Premise implements policies to make submitting fraudulent data increasingly
difficult relative to the benefits of a successful fraudulent contribution. For
instance, requiring a valid photo of the item constrains the range of information
that is appropriate for each submission, making data-based fraud costly in terms
of time and effort. A fraudulent submission would require the contributor to
take an original photo in store and falsify data that mimics a real product. The
cost to do so is perceived by our network members to be greater than capturing
a valid product, limiting the prevalence of this type of fraud.
In addition to automated filter, Premise has built a dedicated fraud team which
analyzes observations on a contributor-by-contributor basis and/or store-by-
store basis to detect patterns of fraud that cannot be discovered by looking at
individual observations in isolation. Premise has a zero-tolerance policy towards
contributor fraud with any verified instance of fraud resulting in a permanent
ban. Because fraud adapts with the methods used to detect it, fraud detection
and prevention procedures are constantly evolving.
4.0 References Used for Subcomponent and Elementary
Product Selection
Argentina
Actualizacion Metodologica - Implementacion de Indices Encadenados En El
IPC-GBA, Instituto Nacional de Estadistica y Censos, 2011
10
Indice de Precios al Consumidor de La Ciudad de Buenos Aires - Principales
aspectos metodologicos, Gobierno de la Ciudad de Buenos Aires, 2013
Indice de Precios al Consumidor GBA, Instituto Nacional de Estadistica y
Censos, 2013
Brazil
Consumer Sector - Latin America Retail, ItauBBA, 2013
Estrutura de ponderão - IPCA - Julho 2013, Instituto Brasiliero de Geografia
e Estatistica, 2013
Indicadores IBGE - Sistema Nacional de Indices de Preços ao Consumidor,
Instituto Brasiliero de Geografia e Estatistica, 2012
China
Annual Report on the Consumer Price Index 2012, Census and Statistics De-
partment - Hong Kong Special Administrative Region, 2012
China Food Manufacturing Annual Report, United States Department of Agricul-
ture, 2013
Statistical Yearbook of the Republic of China, Directorate-General of Budget,
Accounting, and Statistics, Executive Yuan, Republic of China, 2011
India
Consumer Price Index Numbers - Separately for Rural and Urban Areas and Also
Combined, Ministry of Statistics and Programme Implementation - India, 2010
Household Consumption of Various Goods and Services in India, Ministry of
Statistics and Programme Implementation - India, 2012
United States
Handbook of Methods, Bureau of Labor Statistics, 2008
Relative Importance of Components in the Consumer Price Index, Bureau of
Labor Statistics, 2012
Appendix A: Premise Observational Data Schema
Each observation contains a set of metadata that provides more information
about that capture. These metadata fields are available for each observation in
the sample raw datasets on data.premise.com. They are:
11
Field Type Example Description
city_state string “San Diego” A city-level
administrative
region where the
capture was made.
country string “US” The ISO 3166
country where an
offline item was
captured.
created
datetime
“2013-04-16
T17:31:34.000Z”
ISO 8601
timestamp of the
capture.
currency string “USD”
ISO 4217 currency
code for the
item’s price.
language string “en” ISO 639-2
language code.
loc_accuracy double 5.6 (Optional)
Accuracy of the
location
information
reported by the
phone.
loc_lat double 33.0868283 (Optional)
Latitude.
loc_long double -117.26784 (Optional)
Longitude.
local_timestamp
datetime
“2013-04-16
T14:41:34.0 00-03:00”
ISO 8601
timestamp of the
capture in local
time.
local_timezone
datetime
“US/Pacific” (Optional)
Timezone where
the capture was
made.
12
Field Type Example Description
mtime
datetime
“2014-06-12T
13.21.34.000Z”
ISO 8601
timestamp of the
most recent
modification of
the capture
metadata. This
represents the
time at which the
quality control
systems made a
modification to
the capture
metadata.
normalized_price double 0.00992792158 The price of the
item normalized
for size and
quantity.
normalized_size_units
string “g” The units of the
item’s size after
normalization.
place_name string “Albertsons” (Optional) a
string name for
the store.
place_uuid string “780b3709-fab3-
48bf-9a4e-1e81db
02b33a”
A unique
identifier for this
store.
price double 3.49 The observed
purchase price of
the item in local
currency.
quantity string “1” (Optional)
Quantity of items
sold in a group
(e.g. 6 for 6-pack
of Coca-Cola).
size string “12.4” (Optional)
Numeric
component of the
item’s size; e.g.
“473” if the item is
sold as “473ml”
units.
13
Field Type Example Description
size_units string “oz” (Optional) Units
component of the
item’s size; e.g.
“ml” if the item is
sold as “473ml”
units.
spec_brand string “Chips Ahoy” (Optional) The
brand of the spec.
In Premise
nomenclature, a
“spec” is a product
descriptor that
optionally
contains brand
and manufacturer
metadata.
spec_manufacturer string “Nabisco” (Optional) The
manufacturer of
the spec.
spec_product string “Chocolate Chip
Cookies”
The product
being captured.
spec_uuid string “495115e1cf193b
aadb0504b7a87c49
d450eb1db0”
A unique
identifier for this
spec.
timestamp
datetime
“2013-04-16T17:
41:34.000Z”
ISO 8601
timestamp of the
capture in UTC
time.
thumbnail_0x0 string http://d3pg3j2bf4kf
m1.cloudfront.net/0
x0/1fc76a5f3d8e2d
ed10af05f6eb4dd6
93af881441
Link to full-sized
image.
14