Traackr evaluated several NoSQL database options to store its heterogeneous, unstructured web data. Document databases were the best fit due to their flexibility to store variable length text like tweets and blog posts without predefined schemas. MongoDB was selected due to its maturity, adoption, and support for ad-hoc queries and batch processing needed by Traackr in early 2010.
MongoDB & Hadoop - Understanding Your Big DataMongoDB
Big Data is the evolution of supercomputing for commercial enterprise and governments. Originally the domain of companies operating at Internet scale, today Big Data connects organizations of all sizes with discovery about their patterns, and insights into their business.
But understanding the differences between the plethora of new technologies can be daunting. Graph / columnar / key value store / document are all called NoSQL, but which is best? How does Hadoop play in this ecosystem - its low cost and high efficiency have made it very popular, but how does it fit?
In this webinar, we will explore:
The full spectrum of Big Data
Hadoop and MongoDB: friends or frenemies?
Differences between Systems of Record and Systems of Engagement
MongoDB customer examples of Systems of Engagement
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced FeaturesAndrew Liu
Let's talk about how you can get the most out of Azure DocumentDB. In this session we will dive deep into the mechanics of DocumentDB and explain the various levers available to tune performance and scale. From partitioned collections to global databases to advanced indexing and query features - this session will equip you with the best practices and nuggets of information that will become invaluable tools in your toolbox for building blazingly fast large-scale applications.
Relational databases power most applications, but new use-cases have requirements that they are not well suited for.
That's why new approaches like graph databases are used to handle join-heavy, highly-connected and realtime aspects of your applications.
This talk compares relational and graph databases, show similarities and important differences.
We do a hands-on, deep-dive into ease of data modeling and structural evolution, massive data import and high performance querying with Neo4j, the most popular graph database.
I demonstrate a useful tool which makes data import from existing relational databases with a non-denormalized ER-model a "one click"-experience.
Which leaves biggest challenge for people coming from a relational background is to adapt some of their existing database experience to new ways of thinking.
MongoDB & Hadoop - Understanding Your Big DataMongoDB
Big Data is the evolution of supercomputing for commercial enterprise and governments. Originally the domain of companies operating at Internet scale, today Big Data connects organizations of all sizes with discovery about their patterns, and insights into their business.
But understanding the differences between the plethora of new technologies can be daunting. Graph / columnar / key value store / document are all called NoSQL, but which is best? How does Hadoop play in this ecosystem - its low cost and high efficiency have made it very popular, but how does it fit?
In this webinar, we will explore:
The full spectrum of Big Data
Hadoop and MongoDB: friends or frenemies?
Differences between Systems of Record and Systems of Engagement
MongoDB customer examples of Systems of Engagement
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced FeaturesAndrew Liu
Let's talk about how you can get the most out of Azure DocumentDB. In this session we will dive deep into the mechanics of DocumentDB and explain the various levers available to tune performance and scale. From partitioned collections to global databases to advanced indexing and query features - this session will equip you with the best practices and nuggets of information that will become invaluable tools in your toolbox for building blazingly fast large-scale applications.
Relational databases power most applications, but new use-cases have requirements that they are not well suited for.
That's why new approaches like graph databases are used to handle join-heavy, highly-connected and realtime aspects of your applications.
This talk compares relational and graph databases, show similarities and important differences.
We do a hands-on, deep-dive into ease of data modeling and structural evolution, massive data import and high performance querying with Neo4j, the most popular graph database.
I demonstrate a useful tool which makes data import from existing relational databases with a non-denormalized ER-model a "one click"-experience.
Which leaves biggest challenge for people coming from a relational background is to adapt some of their existing database experience to new ways of thinking.
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
In the world of big data, legacy modernization, siloed organizations, empowered customers, and mobile devices, making informed choices about your enterprise infrastructure has become more important than ever. The alternatives are abundant, and the successful Enterprise Architect must constantly discern which new technology is just a shiny object and which will add true business value.
MongoDB is more than just a great application database for developers; it gives Enterprise Architects new capabilities to solve previously difficult architectural requirements much more easily. Take for example the challenge of many siloed systems at MetLife – with MongoDB, the Metlife team was able to successfully provide a single view into those 70 systems, in only 3 months.
In this webinar, we will:
Explore real life challenges enterprises face with case studies of their solutions
Consider how best to introduce MongoDB in the enterprise
Give an overview of how to optimize the use of MongoDB
Introducing Azure DocumentDB - NoSQL, No ProblemAndrew Liu
Application developers support unprecedented rates of change – functionality must rapidly evolve to meet changing customer needs and to respond to competitive pressures while user populations can grow dramatically and unpredictably. To address these realities, developers are selecting document-oriented databases for schema flexibility, scalability and high performance data storage.
In this session, we will get hands on with Azure’s NoSQL document database service. Azure DocumentDB offers full indexing of JSON documents, SQL query capabilities and multi-document transactions. Learn how to get started with Azure DocumentDB and hear about some of the recent improvements to the service.
Everyone is awash in the new buzzword, Big Data, and it seems as if you can’t escape it wherever you go. But there are real companies with real use cases creating real value for their businesses by using big data. This talk will discuss some of the more compelling current or recent projects, their architecture & systems used, and successful outcomes.
Rapid Development and Performance By Transitioning from RDBMSs to MongoDB
Modern day application requirements demand rich & dynamic data structures, fast response times, easy scaling, and low TCO to match the rapidly changing customer & business requirements plus the powerful programming languages used in today's software landscape.
Traditional approaches to solutions development with RDBMSs increasingly expose the gap between the modern development languages and the relational data model, and between scaling up vs. scaling horizontally on commodity hardware. Development time is wasted as the bulk of the work has shifted from adding business features to struggling with the RDBMSs.
MongoDB, the premier NoSQL database, offers a flexible and scalable solution to focus on quickly adding business value again.
In this session, we will provide:
- Overview of MongoDB's capabilities
- Code-level exploration of the MongoDB programming model and APIs and how they transform the way developers interact with a database
- Update of the exciting features in MongoDB 3.0
Family tree of data – provenance and neo4jM. David Allen
Discusses data provenance and how it can be implemented in neo4j, as well as many lessons learned about the relative strengths and weaknesses of relational and graph databases.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
With so much talk of how Big Data is revolutionizing the world and how a data lake with Hadoop and/or Spark will solve all your data problems, it is hard to tell what is hype, reality, or somewhere in-between.
In working with dozens of enterprises in varying stages of their enterprise data management (EDM) strategy, MongoDB enterprise architect, Matt Kalan, sees the same challenges and misunderstandings arise again and again.
In this session, he will explain common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like. MongoDB is uniquely capable of filling common gaps in the data lake strategy.
This session also includes a live Q&A portion during which you are encouraged to ask questions of our team.
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
In the world of big data, legacy modernization, siloed organizations, empowered customers, and mobile devices, making informed choices about your enterprise infrastructure has become more important than ever. The alternatives are abundant, and the successful Enterprise Architect must constantly discern which new technology is just a shiny object and which will add true business value.
MongoDB is more than just a great application database for developers; it gives Enterprise Architects new capabilities to solve previously difficult architectural requirements much more easily. Take for example the challenge of many siloed systems at MetLife – with MongoDB, the Metlife team was able to successfully provide a single view into those 70 systems, in only 3 months.
In this webinar, we will:
Explore real life challenges enterprises face with case studies of their solutions
Consider how best to introduce MongoDB in the enterprise
Give an overview of how to optimize the use of MongoDB
Introducing Azure DocumentDB - NoSQL, No ProblemAndrew Liu
Application developers support unprecedented rates of change – functionality must rapidly evolve to meet changing customer needs and to respond to competitive pressures while user populations can grow dramatically and unpredictably. To address these realities, developers are selecting document-oriented databases for schema flexibility, scalability and high performance data storage.
In this session, we will get hands on with Azure’s NoSQL document database service. Azure DocumentDB offers full indexing of JSON documents, SQL query capabilities and multi-document transactions. Learn how to get started with Azure DocumentDB and hear about some of the recent improvements to the service.
Everyone is awash in the new buzzword, Big Data, and it seems as if you can’t escape it wherever you go. But there are real companies with real use cases creating real value for their businesses by using big data. This talk will discuss some of the more compelling current or recent projects, their architecture & systems used, and successful outcomes.
Rapid Development and Performance By Transitioning from RDBMSs to MongoDB
Modern day application requirements demand rich & dynamic data structures, fast response times, easy scaling, and low TCO to match the rapidly changing customer & business requirements plus the powerful programming languages used in today's software landscape.
Traditional approaches to solutions development with RDBMSs increasingly expose the gap between the modern development languages and the relational data model, and between scaling up vs. scaling horizontally on commodity hardware. Development time is wasted as the bulk of the work has shifted from adding business features to struggling with the RDBMSs.
MongoDB, the premier NoSQL database, offers a flexible and scalable solution to focus on quickly adding business value again.
In this session, we will provide:
- Overview of MongoDB's capabilities
- Code-level exploration of the MongoDB programming model and APIs and how they transform the way developers interact with a database
- Update of the exciting features in MongoDB 3.0
Family tree of data – provenance and neo4jM. David Allen
Discusses data provenance and how it can be implemented in neo4j, as well as many lessons learned about the relative strengths and weaknesses of relational and graph databases.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
With so much talk of how Big Data is revolutionizing the world and how a data lake with Hadoop and/or Spark will solve all your data problems, it is hard to tell what is hype, reality, or somewhere in-between.
In working with dozens of enterprises in varying stages of their enterprise data management (EDM) strategy, MongoDB enterprise architect, Matt Kalan, sees the same challenges and misunderstandings arise again and again.
In this session, he will explain common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like. MongoDB is uniquely capable of filling common gaps in the data lake strategy.
This session also includes a live Q&A portion during which you are encouraged to ask questions of our team.
For eighty years, the Jewish Agency has convened the Jewish people in an unparalleled partnership with a singular purpose: ensuring the Jewish future with a strong Israel at its heart. Together, we have built the State of Israel, bringing over three million Jews home, transformed deserts into communities, and offered opportunity where there was once devastation.
It is the donors, partners, and philanthropists around the world who join our efforts
through both undesignated and designated funding and enabling us to make a
difference. In Israel and around the world, we are able to be the Jewish world’s
representatives as we implement strategic activities, leverage other funds raised,
operate at capacity, and respond to rapidly changing realities and emerging
situations.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
The unprecedented scale at which data is consumed and generated today has shown a large demand for scalable data management and given rise to non-relational, distributed "NoSQL" database systems. Two central problems triggered this process: 1) vast amounts of user-generated content in modern applications and the resulting requests loads and data volumes 2) the desire of the developer community to employ problem-specific data models for storage and querying. To address these needs, various data stores have been developed by both industry and research, arguing that the era of one-size-fits-all database systems is over. The heterogeneity and sheer amount of these systems - now commonly referred to as NoSQL data stores - make it increasingly difficult to select the most appropriate system for a given application. Therefore, these systems are frequently combined in polyglot persistence architectures to leverage each system in its respective sweet spot. This tutorial gives an in-depth survey of the most relevant NoSQL databases to provide comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling and querying characteristics. We present how each system's design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research challenges.
If you'd like to use these slides for e.g. teaching, contact us at gessert at informatik.uni-hamburg.de - we'll send you the PowerPoint.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
Erik Baardse and Ajit Gadge from EDB Postgres presented on how to transform your DBMS in order to drive digital business. How Postgres enables you to support a wider range of workloads with your relational database which opens the Big Data doors. They also cover EnterpriseDB’s Strategy around Big Data which focuses on 3 areas and finally last but not the last how to find money in IT with Big Data and digital transformation
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
Businesses are generating and ingesting an unprecedented volume of structured and unstructured data to be analyzed. Needed is a scalable Big Data infrastructure that processes and parses extremely high volume in real-time and calculates aggregations and statistics. Banking trade data where volumes can exceed billions of messages a day is a perfect example.
Firms are fast approaching 'the wall' in terms of scalability with relational databases, and must stop imposing relational structure on analytics data and map raw trade data to a data model in low latency, preserve the mapped data to disk, and handle ad-hoc data requests for data analytics.
Joe discusses and introduces NoSQL databases, describing how they are capable of scaling far beyond relational databases while maintaining performance , and shares a real-world case study that details the architecture and technologies needed to ingest high-volume data for real-time analytics.
For more information, visit www.casertaconcepts.com
On Friday, September 25th Devin Hopps lead us through a presentation on an Introduction to Big Data and how technology has evolved to harness the power of Big Data.
Similar to Sharing a Startup’s Big Data Lessons (20)
2. Who we are
• A search
engine
• A people
search engine
• An influencer
search engine
• Subscription-
based
3. George Stathis
VP Engineering
14+ years of experience
building full-stack web
software systems with a past
focus on e-commerce and
publishing. Currently
responsible for building
engineering capability to
enable Traackr's growth goals.
4. What’s this talk about?
• Share what we know about Big Data/NoSQL:
what’s behind the buzz words?
• Our reasons and method for picking a NoSQL
database
• Share the lessons we learned going through
the process
6. What is Big Data?
• 3 Vs:
– Volume
– Velocity
– Variety
7. What is Big Data? Volume + Velocity
• Data sets too large or coming in at too high a velocity
to process using traditional databases or desktop tools.
E.g.
big science Astronomy
web logs atmospheric science
rfid genomics
sensor networks biogeochemical
social networks military surveillance
social data medical records
internet text and documents photography archives
internet search indexing video archives
call detail records large-scale e-commerce
8. What is Big Data? Variety
• Big Data is varied and unstructured
Traditional static reports Analytics, exploration &
experimentation
9. What is Big Data?
• Scaling data processing cost effectively
$$$$$
$$$$$$$$ $$$
10. What is NoSQL?
• NoSQL ≠ No SQL
• NoSQL ≈ Not Only SQL
• NoSQL addresses RDBMS limitations, it’s not
about the SQL language
• RDBMS = static schema
• NoSQL = schema flexibility; don’t have to
know exact structure before storing
11. What is Distributed Computing?
• Sharing the workload: divide a problem into
many tasks, each of which can be solved by one
or more computers
• Allows computations to be accomplished in
acceptable timeframes
• Distributed computation approaches were
developed to leverage multiple machines:
MapReduce
• With MapReduce, the program goes to the data
since the data is too big to move
13. What is MapReduce?
• MapReduce = batch processing = analytical
• MapReduce ≠ interactive
• Therefore many NoSQL solutions don’t
outright replace warehouse solutions, they
complement them
• RDBMS is still safe
14. What is Big Data? Velocity
• In some instances, being able to process large
amounts of data in real-time can yield a
competitive advantage. E.g.
– Online retailers leveraging buying history and click-
though data for real-time recommendations
• No time to wait for MapReduce jobs to finish
• Solutions: streaming processing (e.g. Twitter
Storm), pre-computing (e.g. aggregate and count
analytics as data arrives), quick to read key/value
stores (e.g. distributed hashes)
15. What is Big Data? Data Science
• Emergence of Data Science
• Data Scientist ≈ Statistician
• Possess scientific discipline & expertise
• Formulate and test hypotheses
• Understand the math behind the algorithms so
they can tweak when they don’t work
• Can distill the results into an easy to understand
story
• Help businesses gain actionable insights
20. Traackr: context
• A cloud computing company as about to
launch a new platform; how does it find the
most influential IT bloggers on the web that
can help bring visibility to the new product?
How does it find the opinion leaders, the
people that matter?
55. Requirement: batch processing
MapReduce + RDBMS:
Possible but proprietary solutions
Usually involves exporting data from
RDBMS into a NoSQL system anyway.
Defeats data locality benefit of MR
56. Traackr’s Datastore Requirements
• Schema flexibility ✓
• Good at storing lots of variable length text ✓
• Batch processing options ✓
A NoSQL option is the right fit
58. Bewildering number of options (early 2010)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
59. Bewildering number of options (early 2010)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
60. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables while•weSpread sheet like
Graph Databases: can model
• • Key is a row
Designed for high as a graph we don’t want to id
our domain load
• pigeonhole ourselves into this structure. columns
In-memory or on-disk • Attributes are
• Eventually consistent use these tools for can be grouped
We’d rather • Columns
specialized data analysis but not as the
into families
main data store.
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
61. Trimming options
Key/Value Databases Column Databases
Memcache: memory-based,
• Distributed hashtables • Spread sheet like
we need true persistence
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
62. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
Amazon SimpleDB: not willing to
store our data in into families
a proprietary
datastore.
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
63. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
Not willing to store ourProject a
Redis and LinkedIn’s data in
• Value proprietary datastore. •
= Document
Voldermort: no query filters, Great for modeling
• Document used as queues or
better = JSON/BSON networks
• JSON = Flexible Schema
distributed caches • Great for graph-based
query algorithms
64. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
CouchDB: no ad-hoc queries;
• Eventually consistent • Columns can us
maturity in early 2010 madebe grouped
into families
shy away although we did try
early prototypes.
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
65. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases 2010, Graph Databases
Cassandra: in early
• •
maturity questions, no secondary Graph Theory G=(E,V)
Like Key/Value
• Value = Document processing Great for modeling
indexes and no batch •
• options (came later on).
Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
66. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• MongoDB: in earlyis a row id
Designed for high load • Key 2010, maturity
• In-memory or on-disk questions, adoption questions
• Attributes are columns
and no batch processing options.
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
67. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value very close but• in early 2010,
Riak: Graph Theory G=(E,V)
• • Great for
Value = Document adoption questions. modeling
we had
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
68. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value came across as•theGraphmature G=(E,V)
HBase: most Theory
• Value = Document with several deployments, a
at the time, • Great for modeling
• Document = JSON/BSON "out-of-the box"
healthy community, networks
secondary indexes through a contrib and
• JSON = Flexible Schema • Great for graph-based
support for batch processing using
Hadoop/MR query algorithms
.
69. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
70. Rewards: Choices
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
72. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
73. When Big-Data = Big Architectures
Must have an odd Master/slave architecture
number of means a single point of failure,
Zookeeper quorum so you need to protect your
nodes master.
Then you can run your Hbase
nodes but it’s recommended to
co-locate regionservers with
hadoop datanodes so you have
to manage resources.
Must have a Hadoop HDFS
cluster of at least 2x replication
factor nodes
And then we also have to
manage the MapReduce
processes and resources in the
Hadoop layer.
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
76. To be expected
• Hadoop/Hbase are
designed to move
mountains
• If you want to move big
stuff, be prepared to
sometimes use big
equipment
77. What it means to a startup
Development capacity before
Congrats, you
are now a
sysadmin… Development capacity after
78. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
79. Mapping an saved search to a column store
Name
Ranks References to influencer records
80. Mapping an saved search to a column store
“attributes”
column family
Unique for general “influencerId” column family
key attributes for influencer ranks and foreign keys
81. Mapping an saved search to a column store
Influencer ranks
can be attribute
“name” attribute names as well
82. Mapping an saved search to a column store
Can get pretty long so needs indexing and pagination
87. Need to upgrade to Hbase 0.90
• Making sure to remain on recent code base
• Performance improvements
• Mostly to get the latest bug fixes
No thanks!
91. Let’s get this straight
• Hbase no longer comes with secondary
indexing out-of-the-box
• It’s been moved out of the trunk to GitHub
• Where only one other company besides us
seems to care about it
102. Cracks in the data model
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.html
authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
103. Cracks in the data model
huffingtonpost.com
published under
writes for
Denormalized/duplicated
for fast runtime access
http://www.huffingtonpost.com/arianna-huffington/post_1.html
and storage of influencer-
http://www.huffingtonpost.com/arianna-huffington/post_2.html
authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html
to-site relationship
properties
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
104. Cracks in the data model
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.html
authored by
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
Content attribution logic could sometimes
mis-attribute posts because of the
duplicated data.
105. Cracks in the data model
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/arianna-huffington/post_1.html
authored by
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/arianna-huffington/post_2.html
Exacerbated when we started tracking
people’s content on a daily basis in mid-
2011
106. Fixing the cracks in the data model
Normalize the sites
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.html
authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html
writes for
published under
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
107. Fixing the cracks in the data model
• Normalization requires stronger
secondary indexing
• Our application layer indexing would
need revisiting…again!
108. What it means to a startup
Psych! You are back
to writing indexing
code.
Development capacity
110. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
111. Traackr’s Datastore Requirements
(Revisited)
• Schema flexibility
• Good at storing lots of variable length text
• Out-of-the-box SECONDARY INDEX support!
• Simple to use and administer
112. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
113. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Nope!
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
114. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• Graph Databases:•weAttributes are columns
In-memory or on-disk looked at
• • Columns can
Eventually consistent closer but passed again be grouped
Neo4J a bit
for the same reasons into families
as before.
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
115. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
Memcache: still no
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
116. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
Amazon SimpleDB: still no.
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
117. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
Not willing to store ourProject a
Redis and LinkedIn’s data in
• Value proprietary datastore. •
=Voldermort: still no
Document Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
118. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
CouchDB: more mature but still
• Eventually consistent • Columns can
no ad-hoc queries. be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
119. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databasesa bit, added
Cassandra: matured quite Graph Databases
secondary indexes and batch processing
• •
Like Key/Valuerestrictive in its’ use than Graph Theory G=(E,V)
options but more
• •
Value =solutions. After the Hbase lesson, Great for modeling
other Document
• Document useJSON/BSON
simplicity of
= was now more important. networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
120. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• • Graph Theory G=(E,V)
Like Key/Value strong contender still but
Riak:
• • Great for
Value = Document questions remained. modeling
adoption
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
121. NoSQL picking – Round 2 (mid 2011)
Key/Value Databasesby leaps Column Databases
MongoDB: matured and bounds, increased
• • Spread sheet like
Distributed hashtables 10gen, advanced indexing
adoption, support from
• • batch processing
Designed for high load as some Key is a row id
out-of-the-box as well
options, breeze to use, well documented and fit into
• • Attributes
In-memory or on-disk code base very nicely. are columns
our existing
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
122. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
124. What it means to a startup
Yay! I’m back!
Development capacity
125. Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
126. What it means to a startup
Honestly, I thought
I’d never see you
guys again!
Development capacity
127. Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
• Our NoSQL could now support our domain
model
131. Other Benefits
• Ad hoc queries and reports became easier to write with JavaScript:
no need for a Java developer to write map reduce code to extract
the data in a usable form like it was needed with Hbase.
• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
cluster replication is available but experimental and a lot more
involved to setup.
• Great documentation
• Great adoption and community
134. And less of this
Source: socialbutterflyclt.com
135. Recap & Final Thoughts
• 3 Vs of Big Data:
– Volume
– Velocity
– Variety Traackr
• Big Data technologies are complementary to
SQL and RDBMS
• Until machines can think for themselves Data
Science will be increasingly important
136. Recap & Final Thoughts
• Be prepared to deal with less mature tech
• Be as flexible as the data => fearless
refactoring
• Importance of ease of use and
administration cannot be overstated for a
small startup
Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Taking a look at the amount of storage we are using as of a month ago in Mongo; this includes indexes
The point is that we don’t need to track the entire web: just the subset belonging to influencers!
There is a different perspective on “Web Scale” that has to do with the nature of the data on the web
Take the approach of using a simplifiedentity model
…withsemi-structured data storage formats like JSON:Facilitate capturing related attribute structures Enablethe flexibility of definingnew attributes as they are discovered
CLOB pre-allocated space
Sparse maps
- This is something we thought we needed back in early 2010- Traack needs to score its’ entire DB of influencers on a weekly basis to adjust the weighted averages and stats that drive the scores. This means processing north of 750K of sites, over 650K influencers and soon, millions of posts (post-level attributes)
Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure. We’d rather use these tools for specialized data analysis but not as the main data store.
Memcache: memory-based,we need true persistence
Amazon SimpleDB: not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: no query filters, better used as queues or distributed caches
CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try early prototypes
Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options (came later on).
MongoDB: in early 2010, maturity questions, adoption questions and no batch processing options
Riak: very close but in early 2010, we had adoption questions
HBase: came across as the most mature at the time, with several deployments, a healthy community, "out-of-the box" secondary indexes through a contrib and support for batch processing using Hadoop/MR Hadoop and its’ maturity was a big reason we picked HBase
Had to deal with a complex right from the start:- minimum number of data nodes to support replication- odd number of zookeper nodes to avoid voting deadlocks- co-locating region servers = paying close attention to JVM resources- Master = SPOF- co-locating job trackers = paying close attention to JVM resources
- Quick overview of how we modeled a list in hbase => saved searches- This is what our customers see- Let's consider the name, the ranks of the influencers and the influencer references
Each row has a unique key: the alist idWe would group general attributes under one family of columns appropriately named “attributes”. Benefit: can get Alist information without loading all the influencersWe would group the influencer references under another family of columns named “influencerIds”
Now we can see where the attributes we see on the screen are stored
- We coded the pagination and indexing features ourselves and contributed them back- Felt really good about it!
It wasn’t bad enough we had to write our own code to support our indexing needs, we now had to maintain a third-party code base that was quickly becoming outdated!
Simplified example for posts
Denormalized/duplicated for fast runtime access and storage of influencer-to-site relationship properties
Content attribution logic could sometimes mis-attribute posts because of the duplicated
Exacerbated when we started tracking people’s content on a daily basis in mid-2011
Graph Databases: we looked at Neo4J a bit closer but passed again for the same reasons as before
CouchDB: more mature but still no ad-hoc queries
Cassandra: matured quite a bit, added secondary indexes and batch processing options but more restrictive in its’ use than other solutions. After the Hbase lesson, simplicity of use was now more important
Riak: strong contender still but adoption questions
MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing options, breeze to use, well documented and fit into our existing code base very nicely.
Embedded list of references to sites augmented with influencer-specific site attributes (e.g. percent contribution to content)
siteId indexed for “find influencers connected to site X”
Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element