Democratization of
Data
Why and how we built an internal data pipeline
platform @Indix
About me
Manoj Mahalingam
Principal Engineer @Indix
People
Documents Businesses
Places Products
Connected
Devices
Six Business Critical Indexes
Enabling businesses to build
location-aware software.
~3.6 million websites use Google maps
Enabling businesses to build
product-aware software.
Indix catalogs over 2.1 billion product offers
Indix - The “Google Maps” of Products
Crawling Pipeline
Data PipelineML
AggregateMatchStandardizeExtract AttributesClassifyDedupe
Parse
Crawl
Data
CrawlSeed
Brand & Retailer
Websites
Feeds Pipeline
Transform Clean Connect
Feed
Data
Brand & Retailer
Feeds
Indix Product
Catalog
Customizable
Feeds
Search &
Analytics
Index
Indexing PipelineReal Time
Index Analyze Derive Join
API
(Bulk &
Synchronous)
Product Data
Transformation
Service
Data Pipeline @Indix
Democratization
of Data
Enable everyone in the organization
to know what data is available, and
then understand and work with it.
At Indix, we have
and work with a lot
of data.
Scale of Data @ Indix
2.1
Billion
Product
URLs 8 TB
HTML Data
Crawled
Daily
1B
Unique
Products
7000
Categories
120 B
Price
Points
3000
Sites
● We have data in different
shapes and sizes.
● HTML pages, Thrift and
avro records.
● And also the usual suspects
- CSVs and plain text data.
● Datasets can be in TBs or a
few hundred KBs.
● Few billion records or a
couple of hundreds.
But...the data’s
potential couldn’t
be realized
Data wasn’t discoverable
● The biggest problem was in knowing what data exists and
where.
● Some of the data was in S3. Some in HDFS. Some in
Google sheets.
● There was no way to know how frequently and when the
data changed or updated.
The schema wasn’t readily
known
● The schema of the data, as expected, kept changing and it
was difficult to keep track of which version of data had
which schema.
● While Thrift and Avro alleviate this to an extent, access
to data wasn’t simple, especially for non-engineers.
Writing code limited scope
● We use Scalding and Spark for our MR jobs. Having to
code and tweak the jobs limited the scope of who can
write and run these jobs.
● “Readymade” jobs may not enable desired tweaks if
needed, affecting productivity and increasing
dependencies.
● Having to write code and ship jars hinders adhoc data
experimentation.
Cost control wasn’t trivial
● While data came in various sizes and shapes, what people
did with the data also varied - some use cases needed
sample of the data, while others wanted aggregations on
the entire data.
● It wasn’t trivial to handle all the different workloads while
minimizing costs.
● There was also the problem of adhoc jobs starving
production jobs in our existing Hadoop clusters.
Goals of Internal Data
Pipeline Platform
Enable easy discovery of
data.
Allow Schema to be
transparent and easy to
create while also allowing
introspection.
Minimal coding - have
prebuilt transformations for
common tasks and enable
SQL based workflow.
Goals of Internal Data
Pipeline Platform
UI and Wizard based
workflow to enable ANYONE
in the organization to run
pipelines and extract data.
Manage underlying clusters
and resources transparently
while optimizing for costs.
Support data
experimentations and also
production / customer use
cases.
MDA - Marketplace
of Datasets and
Algorithms
Tech Stack
MDA - DEMO!!!
MDA with our Data Pipeline
MatchAttributesBrandClassifyDedup
MDA with our Data Pipeline
MatchAttributesBrandClassifyDedup
Enrich Data Classify Brand
Feed data from
Customer
Feed output to
customer
MDA for ML Training Data
Filter Sample Preprocess
Training Data
Notebooks
//Setup the MDA client
import com.indix.holonet.core.client.SDKClient
val host = "holonet.force.io"
val port = 80
val client = SDKClient(host, port, spark)
//Create dataframe from any MDA dataset
val df = client.toDF("Indix", "PriceHistoryProfile")
df.show
Dec 2015
Start work on MDA
Mar 2016
First release
Lot more transforms
including sampling,
full Hive SQL support
and UX fixes
Late 2016
Performance
improvements, Spark
and infra upgrades.
June 2017
Ability to run pipelines
in customer’s cloud
infra
Jul 2016 Early 2017
Completely redesign
the UI based on over
year of feedback and
learnings. GraphQL for
the UI.
First closed preview of
MDA for a customer
Aug 2017
What does the future hold?
● We are far from done - things like automatic schema
inference, better caching are already planned.
● And as is the original vision, make it fully self-served for
our customers (internal and external.)
● Integration with other tools out there like Superset
● Open source as much as possible. First cut -
http://github.com/indix/sparkplug
Questions?
I blog at https://stacktoheap.com
Twitter and most other platforms @manojlds

Democratization of Data @Indix

  • 1.
    Democratization of Data Why andhow we built an internal data pipeline platform @Indix
  • 2.
  • 3.
  • 4.
    Enabling businesses tobuild location-aware software. ~3.6 million websites use Google maps Enabling businesses to build product-aware software. Indix catalogs over 2.1 billion product offers Indix - The “Google Maps” of Products
  • 5.
    Crawling Pipeline Data PipelineML AggregateMatchStandardizeExtractAttributesClassifyDedupe Parse Crawl Data CrawlSeed Brand & Retailer Websites Feeds Pipeline Transform Clean Connect Feed Data Brand & Retailer Feeds Indix Product Catalog Customizable Feeds Search & Analytics Index Indexing PipelineReal Time Index Analyze Derive Join API (Bulk & Synchronous) Product Data Transformation Service Data Pipeline @Indix
  • 6.
    Democratization of Data Enable everyonein the organization to know what data is available, and then understand and work with it.
  • 7.
    At Indix, wehave and work with a lot of data.
  • 8.
    Scale of Data@ Indix 2.1 Billion Product URLs 8 TB HTML Data Crawled Daily 1B Unique Products 7000 Categories 120 B Price Points 3000 Sites
  • 9.
    ● We havedata in different shapes and sizes. ● HTML pages, Thrift and avro records. ● And also the usual suspects - CSVs and plain text data.
  • 10.
    ● Datasets canbe in TBs or a few hundred KBs. ● Few billion records or a couple of hundreds.
  • 11.
  • 12.
    Data wasn’t discoverable ●The biggest problem was in knowing what data exists and where. ● Some of the data was in S3. Some in HDFS. Some in Google sheets. ● There was no way to know how frequently and when the data changed or updated.
  • 13.
    The schema wasn’treadily known ● The schema of the data, as expected, kept changing and it was difficult to keep track of which version of data had which schema. ● While Thrift and Avro alleviate this to an extent, access to data wasn’t simple, especially for non-engineers.
  • 14.
    Writing code limitedscope ● We use Scalding and Spark for our MR jobs. Having to code and tweak the jobs limited the scope of who can write and run these jobs. ● “Readymade” jobs may not enable desired tweaks if needed, affecting productivity and increasing dependencies. ● Having to write code and ship jars hinders adhoc data experimentation.
  • 15.
    Cost control wasn’ttrivial ● While data came in various sizes and shapes, what people did with the data also varied - some use cases needed sample of the data, while others wanted aggregations on the entire data. ● It wasn’t trivial to handle all the different workloads while minimizing costs. ● There was also the problem of adhoc jobs starving production jobs in our existing Hadoop clusters.
  • 16.
    Goals of InternalData Pipeline Platform Enable easy discovery of data. Allow Schema to be transparent and easy to create while also allowing introspection. Minimal coding - have prebuilt transformations for common tasks and enable SQL based workflow.
  • 17.
    Goals of InternalData Pipeline Platform UI and Wizard based workflow to enable ANYONE in the organization to run pipelines and extract data. Manage underlying clusters and resources transparently while optimizing for costs. Support data experimentations and also production / customer use cases.
  • 18.
    MDA - Marketplace ofDatasets and Algorithms
  • 19.
  • 20.
  • 21.
    MDA with ourData Pipeline MatchAttributesBrandClassifyDedup
  • 22.
    MDA with ourData Pipeline MatchAttributesBrandClassifyDedup Enrich Data Classify Brand Feed data from Customer Feed output to customer
  • 23.
    MDA for MLTraining Data Filter Sample Preprocess Training Data
  • 24.
    Notebooks //Setup the MDAclient import com.indix.holonet.core.client.SDKClient val host = "holonet.force.io" val port = 80 val client = SDKClient(host, port, spark) //Create dataframe from any MDA dataset val df = client.toDF("Indix", "PriceHistoryProfile") df.show
  • 25.
    Dec 2015 Start workon MDA Mar 2016 First release Lot more transforms including sampling, full Hive SQL support and UX fixes Late 2016 Performance improvements, Spark and infra upgrades. June 2017 Ability to run pipelines in customer’s cloud infra Jul 2016 Early 2017 Completely redesign the UI based on over year of feedback and learnings. GraphQL for the UI. First closed preview of MDA for a customer Aug 2017
  • 26.
    What does thefuture hold? ● We are far from done - things like automatic schema inference, better caching are already planned. ● And as is the original vision, make it fully self-served for our customers (internal and external.) ● Integration with other tools out there like Superset ● Open source as much as possible. First cut - http://github.com/indix/sparkplug
  • 27.
    Questions? I blog athttps://stacktoheap.com Twitter and most other platforms @manojlds