Sharing a Startup’s Big Data Lessons

Sharing a Startup’s Big Data
Lessons
Experiences with non-RDBMS solutions at

Who we are
• A search
engine
• A people
search engine
• An influencer
search engine
• Subscription-
based

George Stathis

VP Engineering
14+ years of experience
building full-stack web
software systems with a past
focus on e-commerce and
publishing. Currently
responsible for building
engineering capability to
enable Traackr's growth goals.

What’s this talk about?

• Share what we know about Big Data/NoSQL:
what’s behind the buzz words?
• Our reasons and method for picking a NoSQL
database
• Share the lessons we learned going through
the process

Big Data/NoSQL: behind the buzz words

What is Big Data?
• 3 Vs:
– Volume
– Velocity
– Variety

What is Big Data? Volume + Velocity
• Data sets too large or coming in at too high a velocity
to process using traditional databases or desktop tools.
E.g.

big science Astronomy
web logs atmospheric science
rfid genomics
sensor networks biogeochemical
social networks military surveillance
social data medical records
internet text and documents photography archives
internet search indexing video archives
call detail records large-scale e-commerce

What is Big Data? Variety
• Big Data is varied and unstructured
Traditional static reports Analytics, exploration &
experimentation

What is Big Data?
• Scaling data processing cost effectively

$$$$$

$$$$$$$$ $$$

What is NoSQL?
• NoSQL ≠ No SQL
• NoSQL ≈ Not Only SQL
• NoSQL addresses RDBMS limitations, it’s not
about the SQL language
• RDBMS = static schema
• NoSQL = schema flexibility; don’t have to
know exact structure before storing

What is Distributed Computing?
• Sharing the workload: divide a problem into
many tasks, each of which can be solved by one
or more computers
• Allows computations to be accomplished in
acceptable timeframes
• Distributed computation approaches were
developed to leverage multiple machines:
MapReduce
• With MapReduce, the program goes to the data
since the data is too big to move

What is MapReduce?

Source: developer.yahoo.com

What is MapReduce?
• MapReduce = batch processing = analytical
• MapReduce ≠ interactive
• Therefore many NoSQL solutions don’t
outright replace warehouse solutions, they
complement them
• RDBMS is still safe 

What is Big Data? Velocity
• In some instances, being able to process large
amounts of data in real-time can yield a
competitive advantage. E.g.
– Online retailers leveraging buying history and click-
though data for real-time recommendations
• No time to wait for MapReduce jobs to finish
• Solutions: streaming processing (e.g. Twitter
Storm), pre-computing (e.g. aggregate and count
analytics as data arrives), quick to read key/value
stores (e.g. distributed hashes)

What is Big Data? Data Science
• Emergence of Data Science
• Data Scientist ≈ Statistician
• Possess scientific discipline & expertise
• Formulate and test hypotheses
• Understand the math behind the algorithms so
they can tweak when they don’t work
• Can distill the results into an easy to understand
story
• Help businesses gain actionable insights

Big Data Landscape

Source: capgemini.com

So what’s Traackr and why did we
need a NoSQL DB?

Traackr: context
• A cloud computing company as about to
launch a new platform; how does it find the
most influential IT bloggers on the web that
can help bring visibility to the new product?
How does it find the opinion leaders, the
people that matter?

Traackr: a people search engine

Up to 50 keywords per search!

Traackr: a people search engine
Proprietary
3-scale ranking

People
as Content
search aggregated
results by author

Traackr: 30,000 feet

Acquisition Processing Storage & Indexing Services Applications

NoSQL is usually associated with
“Web Scale” (Volume & Velocity)

Do we fit the “Web scale” profile?

• In terms of users/traffic?



• In terms of the amount of data?

PRIMARY> use traackr
switched to db traackr
PRIMARY> db.stats()
{
"db" : "traackr",
"collections" : 12,
"objects" : 68226121,
"avgObjSize" : 2972.0800625760330,
That’s a quarter of a
"dataSize" : 202773493971,
terabyte …
"storageSize" : 221491429671,
"numExtents" : 199,
"indexes" : 33,
"indexSize" : 27472394891,
"fileSize" : 266623699968,
"nsSizeMB" : 16,
"ok" : 1
}

Wait! What? My
Synology NAS at home
can hold 2TB!

No need for us to track the entire web

Influencer
Content

Web Content

Not at scale :-)

Variety view of “Web Scale”

Web data is:

Heterogeneous

Unstructured (text)

Visualization of the Internet, Nov. 23rd 2003
Source: http://www.opte.org/

Data sources are
isolated islands of rich
data with lose links to
one another

How do we build a database that
models all possible entities found on
the web?

Modeling the web: the RDBMS way

Source: socialbutterflyclt.com

{
"realName": "David Chancogne",
"title": "CTO",
"description": "Web. Geek.rnTraackr: http://traackr.comrnPropz: http://propz.me",
"primaryAffiliation": "Traackr",
"email": "dchancogne@traackr.com",
"location": "Cambridge, MA, United States",
"siteReferences": [
{
"siteUrl": "http://twitter.com/dchancogne",
"metrics": [
{
"value": 216,
"name": "twitter_followers_count"
},
{
"value": 2107,
"name": "twitter_statuses_count"
}
]
},
{
"siteUrl": "http://traackr.com/blog/author/david",
"metrics": [
{
"value": 21,
"name": "google_inbound_links"
}
]
}
]
}
Influencer data as JSON



• In terms of the amount of data?

• In terms of the variety of the data ✓

Traackr’s Datastore Requirements

• Schema flexibility ✓
• Good at storing lots of variable length text

• Batch processing options

Requirement: text storage

Variable text length:
140 multi-page
character < big variance <
tweets blog posts


RDBMS’ answer to variable text length:

Plan ahead for largest value

CLOB/BLOB


Issues with CLOB/BLOG for us:

No clue what largest value is

CLOB/BLOB for tweets = wasted space


NoSQL solutions are great for text:

No length requirements (automated
chunking)

Limited space overhead


• Good at storing lots of variable length text ✓
• Batch processing options

Requirement: batch processing

Some NoSQL
solutions come
with MapReduce

Source: http://code.google.com/

Requirement: batch processing

MapReduce + RDBMS:

Possible but proprietary solutions
Usually involves exporting data from
RDBMS into a NoSQL system anyway.
Defeats data locality benefit of MR


• Good at storing lots of variable length text ✓
• Batch processing options ✓

A NoSQL option is the right fit

Bewildering number of options (early 2010)

Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families

Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms

Trimming options
• Distributed hashtables while•weSpread sheet like
Graph Databases: can model
• • Key is a row
Designed for high as a graph we don’t want to id
our domain load
• pigeonhole ourselves into this structure. columns
In-memory or on-disk • Attributes are
• Eventually consistent use these tools for can be grouped
We’d rather • Columns
specialized data analysis but not as the
into families
main data store.

query algorithms

Trimming options
Memcache: memory-based,
we need true persistence
into families

query algorithms

Trimming options
Amazon SimpleDB: not willing to
store our data in into families
a proprietary
datastore.
query algorithms

Trimming options
into families

Not willing to store ourProject a
Redis and LinkedIn’s data in
• Value proprietary datastore. •
= Document
Voldermort: no query filters, Great for modeling
• Document used as queues or
better = JSON/BSON networks
• JSON = Flexible Schema
distributed caches • Great for graph-based
query algorithms

Trimming options
CouchDB: no ad-hoc queries;
• Eventually consistent • Columns can us
maturity in early 2010 madebe grouped
into families
shy away although we did try
early prototypes.
query algorithms

Trimming options
into families

Document Databases 2010, Graph Databases
Cassandra: in early
• •
maturity questions, no secondary Graph Theory G=(E,V)
Like Key/Value
• Value = Document processing Great for modeling
indexes and no batch •
• options (came later on).
Document = JSON/BSON networks
query algorithms

Trimming options
• MongoDB: in earlyis a row id
Designed for high load • Key 2010, maturity
• In-memory or on-disk questions, adoption questions
• Attributes are columns
and no batch processing options.
into families

query algorithms

Trimming options
into families

• Like Key/Value very close but• in early 2010,
Riak: Graph Theory G=(E,V)
• • Great for
Value = Document adoption questions. modeling
we had
query algorithms

Trimming options
into families

• Like Key/Value came across as•theGraphmature G=(E,V)
HBase: most Theory
• Value = Document with several deployments, a
at the time, • Great for modeling
• Document = JSON/BSON "out-of-the box"
healthy community, networks
secondary indexes through a contrib and
support for batch processing using
Hadoop/MR query algorithms
.

Lessons Learned

Challenges Rewards
- Complexity - Choices

- Missing Features - Empowering

- Problem solution fit - Community

- Resources - Cost

Rewards: Choices
into families

query algorithms

Rewards: Choices

Source: capgemini.com

When Big-Data = Big Architectures
Must have an odd Master/slave architecture
number of means a single point of failure,
Zookeeper quorum so you need to protect your
nodes master.

Then you can run your Hbase
nodes but it’s recommended to
co-locate regionservers with
hadoop datanodes so you have
to manage resources.
Must have a Hadoop HDFS
cluster of at least 2x replication
factor nodes

And then we also have to
manage the MapReduce
processes and resources in the
Hadoop layer.

Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Jokes aside, no one said open source
was easy to use

To be expected
• Hadoop/Hbase are
designed to move
mountains

• If you want to move big
stuff, be prepared to
sometimes use big
equipment

What it means to a startup

Development capacity before

Congrats, you
are now a
sysadmin… Development capacity after

Mapping an saved search to a column store

Name

Ranks References to influencer records

“attributes”
column family
Unique for general “influencerId” column family
key attributes for influencer ranks and foreign keys


Influencer ranks
can be attribute
“name” attribute names as well


Can get pretty long so needs indexing and pagination

Problem: no out-of-the-box row-based
indexing and pagination

Need to upgrade to Hbase 0.90

• Making sure to remain on recent code base

• Performance improvements

• Mostly to get the latest bug fixes

No thanks!

Looks like something is missing

Our DB indexes depend on this!

Let’s get this straight

• Hbase no longer comes with secondary
indexing out-of-the-box

• It’s been moved out of the trunk to GitHub

• Where only one other company besides us
seems to care about it

Only one other
maintainer
besides us

Congrats, you are
now an hbase
contrib maintainer…

Development capacity

Homegrown Hbase Indexes
Row ids for Posts

Rows have id prefixes that can be
efficiently scanned using STARTROW
and STOPROW filters

Row ids for Posts

Find posts for
influencer_id_1234

Row ids for Posts

Find posts for
influencer_id_5678


• No longer depending on
unmaintained code

• Work with out-of-the-box Hbase
installation

You are back but you
still need to
maintain indexing
logic


Cracks in the data model
huffingtonpost.com
published under
writes for

http://www.huffingtonpost.com/arianna-huffington/post_1.html
authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html

huffingtonpost.com
published under
writes for

http://www.huffingtonpost.com/shaun-donovan/post1.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html

huffingtonpost.com
published under
writes for
Denormalized/duplicated
for fast runtime access
and storage of influencer-
to-site relationship
properties

huffingtonpost.com
published under
writes for


huffingtonpost.com
published under
writes for

authored by

huffingtonpost.com
published under
writes for


Content attribution logic could sometimes
mis-attribute posts because of the
duplicated data.

huffingtonpost.com
published under
writes for


authored by

huffingtonpost.com
published under
writes for


Exacerbated when we started tracking
people’s content on a daily basis in mid-
2011

Fixing the cracks in the data model

Normalize the sites


writes for
published under
huffingtonpost.com
published under
writes for


Fixing the cracks in the data model

• Normalization requires stronger
secondary indexing

• Our application layer indexing would
need revisiting…again!

Psych! You are back
to writing indexing
code.


(Revisited)
• Schema flexibility

• Good at storing lots of variable length text

• Out-of-the-box SECONDARY INDEX support!

• Simple to use and administer

NoSQL picking – Round 2 (mid 2011)
into families

query algorithms

into families
Nope!
query algorithms

• Graph Databases:•weAttributes are columns
In-memory or on-disk looked at
• • Columns can
Eventually consistent closer but passed again be grouped
Neo4J a bit
for the same reasons into families
as before.

query algorithms

Memcache: still no
into families

query algorithms

Amazon SimpleDB: still no.
into families

query algorithms

into families

Not willing to store ourProject a
Redis and LinkedIn’s data in
• Value proprietary datastore. •
=Voldermort: still no
Document Great for modeling
query algorithms

CouchDB: more mature but still
• Eventually consistent • Columns can
no ad-hoc queries. be grouped
into families

query algorithms

into families

Document Databasesa bit, added
Cassandra: matured quite Graph Databases
secondary indexes and batch processing
• •
Like Key/Valuerestrictive in its’ use than Graph Theory G=(E,V)
options but more
• •
Value =solutions. After the Hbase lesson, Great for modeling
other Document
• Document useJSON/BSON
simplicity of
= was now more important. networks
query algorithms

into families

• • Graph Theory G=(E,V)
Like Key/Value strong contender still but
Riak:
• • Great for
Value = Document questions remained. modeling
adoption
query algorithms

Key/Value Databasesby leaps Column Databases
MongoDB: matured and bounds, increased
• • Spread sheet like
Distributed hashtables 10gen, advanced indexing
adoption, support from
• • batch processing
Designed for high load as some Key is a row id
out-of-the-box as well
options, breeze to use, well documented and fit into
• • Attributes
In-memory or on-disk code base very nicely. are columns
our existing
into families

query algorithms

Immediate Benefits

• No more maintaining custom application-layer
secondary indexing code

Yay! I’m back!


Immediate Benefits


• Single binary installation greatly simplifies
administration

Honestly, I thought
I’d never see you
guys again!


Immediate Benefits

• Single binary installation greatly simplifies
administration
• Our NoSQL could now support our domain
model

{
”_id": "770cf5c54492344ad5e45fb791ae5d52”,
"title": "CTO",
"siteReferences": [
{ Embedded list of
"siteId": "b31236da306270dc2b5db34e943af88d", references to sites
"contribution": 0.25 augmented with
},
influencer-specific
{
"siteId": "602dc370945d3b3480fff4f2a541227c", site attributes (e.g.
"contribution": 1.0 percent contribution
} to content)
]
}

Modeling an influencer

{
”_id": "770cf5c54492344ad5e45fb791ae5d52”,
"title": "CTO",
"siteReferences": [
{
"siteId": "b31236da306270dc2b5db34e943af88d",
"contribution": 0.25 siteId indexed for
}, “find influencers
{ connected to site X”
"siteId": "602dc370945d3b3480fff4f2a541227c",
"contribution": 1.0
}
]
}

> db.influencers.ensureIndex({siteReferences.siteId: 1});
> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});

Modeling an influencer

Other Benefits
• Ad hoc queries and reports became easier to write with JavaScript:
no need for a Java developer to write map reduce code to extract
the data in a usable form like it was needed with Hbase.

• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
cluster replication is available but experimental and a lot more
involved to setup.

• Great documentation

• Great adoption and community

looks like we found the right fit!

We have more of this


And less of this

Source: socialbutterflyclt.com

Recap & Final Thoughts
• 3 Vs of Big Data:
– Volume
– Velocity
– Variety  Traackr
• Big Data technologies are complementary to
SQL and RDBMS
• Until machines can think for themselves Data
Science will be increasingly important

Recap & Final Thoughts

• Be prepared to deal with less mature tech
• Be as flexible as the data => fearless
refactoring
• Importance of ease of use and
administration cannot be overstated for a
small startup

Sharing a Startup’s Big Data Lessons

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Sharing a Startup’s Big Data Lessons

Similar to Sharing a Startup’s Big Data Lessons (20)

Sharing a Startup’s Big Data Lessons

Editor's Notes