Big data Analytics is a process to extract meaningful insight from big such as hidden patterns, unknown correlations, market trends and customer preferences
Business Intelligence: Multidimensional AnalysisMichael Lamont
An introduction to multidimensional business intelligence and OnLine Analytical Processing (OLAP) suitable for both a technical and non-technical audience. Covers dimensions, attributes, measures, Key Performance Indicators (KPIs), aggregates, hierarchies, and data cubes.
Big data Analytics is a process to extract meaningful insight from big such as hidden patterns, unknown correlations, market trends and customer preferences
Business Intelligence: Multidimensional AnalysisMichael Lamont
An introduction to multidimensional business intelligence and OnLine Analytical Processing (OLAP) suitable for both a technical and non-technical audience. Covers dimensions, attributes, measures, Key Performance Indicators (KPIs), aggregates, hierarchies, and data cubes.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Anastasija Nikiforova
This presentation was delivered as part of the Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence“ organized by the Institute of Computer Science (University of Tartu) in cooperation with Swedbank.
In this presentation I talked about:
*“Data warehouse vs. data lake – what are they and what is the difference between them?” (structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT) with further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons?
*“Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
*“So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
*“But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
*“So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
*“Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
*“If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…
In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse is nut sufficiently mature concept with different definitions of it).
Data Warehouse, Data Warehouse Architecture, Data Warehouse Concept, Data Warehouse Modeling, OLAP, OLAP Operations, Data Cube, Data Processing, Data Cleaning, Data Reduction, Data Integration, Data Transformation
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
This presentation explains what data engineering is and describes the data lifecycles phases briefly. I used this presentation during my work as an on-demand instructor at Nooreed.com
Science of culture? Computational analysis and visualization of cultural imag...Lev Manovich
Concepts, research questions and examples of computational analysis and visualizations of cultural image collections from our research lab (softwarestudies.com) created between 2009 and 2015. Visualized datasets include 20,000 images from MoMA photo collection, 773 Vincent van Gogh paintings, and 2.3 million Instagram images from 13 cities worldwide. (Note that the original presentation has a few videos that are not part of this PDF document.)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Anastasija Nikiforova
This presentation was delivered as part of the Data Science Seminar titled “When, Why and How? The Importance of Business Intelligence“ organized by the Institute of Computer Science (University of Tartu) in cooperation with Swedbank.
In this presentation I talked about:
*“Data warehouse vs. data lake – what are they and what is the difference between them?” (structured vs unstructured, static vs dynamic (real-time data), schema-on-write vs schema on-read, ETL vs ELT) with further elaboration on What are their goals and purposes? What is their target audience? What are their pros and cons?
*“Is the Data warehouse the only data repository suitable for BI?” – no, (today) data lakes can also be suitable. And even more, both are considered the key to “a single version of the truth”. Although, if descriptive BI is the only purpose, it might still be better to stay within data warehouse. But, if you want to either have predictive BI or use your data for ML (or do not have a specific idea on how you want to use the data, but want to be able to explore your data effectively and efficiently), you know that a data warehouse might not be the best option.
*“So, the data lake will save my resources a lot, because I do not have to worry about how to store /allocate the data – just put it in one storage and voila?!” – no, in this case your data lake will turn into a data swamp! And you are forgetting about the data quality you should (must!) be thinking of!
*“But how do you prevent the data lake from becoming a data swamp?” – in short and simple terms – proper data governance & metadata management is the answer (but not as easy as it sounds – do not forget about your data engineer and be friendly with him [always… literally always :D) and also think about the culture in your organization.
*“So, the use of a data warehouse is the key to high quality data?” – no, it is not! Having ETL do not guarantee the quality of your data (transform&load is not data quality management). Think about data quality regardless of the repository!
*“Are data warehouses and data lakes the only options to consider or are we missing something?“– true! Data lakehouse!
*“If a data lakehouse is a combination of benefits of a data warehouse and data lake, is it a silver bullet?“– no, it is not! This is another option (relatively immature) to consider that may be the best bit for you, but not a panacea. Dealing with data is not easy (still)…
In addition, in this talk I also briefly introduced the ongoing research into the integration of the data lake as a data repository and data wrangling seeking for an increased data quality in IS. In short, this is somewhat like an improved data lakehouse, where we emphasize the need of data governance and data wrangling to be integrated to really get the benefits that the data lakehouses promise (although we still call it a data lake, since a data lakehouse is nut sufficiently mature concept with different definitions of it).
Data Warehouse, Data Warehouse Architecture, Data Warehouse Concept, Data Warehouse Modeling, OLAP, OLAP Operations, Data Cube, Data Processing, Data Cleaning, Data Reduction, Data Integration, Data Transformation
All about Big Data components and the best tools to ingest, process, store and visualize the data.
This is a keynote from the series "by Developer for Developers" powered by eSolutionsGrup.
This presentation explains what data engineering is and describes the data lifecycles phases briefly. I used this presentation during my work as an on-demand instructor at Nooreed.com
Science of culture? Computational analysis and visualization of cultural imag...Lev Manovich
Concepts, research questions and examples of computational analysis and visualizations of cultural image collections from our research lab (softwarestudies.com) created between 2009 and 2015. Visualized datasets include 20,000 images from MoMA photo collection, 773 Vincent van Gogh paintings, and 2.3 million Instagram images from 13 cities worldwide. (Note that the original presentation has a few videos that are not part of this PDF document.)
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn. This was a presentation made at QCon 2009 and is embedded on LinkedIn's blog - http://blog.linkedin.com/
Solr cloud the 'search first' nosql database extended deep divelucenerevolution
Presented by Mark Miller, Software Engineer, Cloudera
As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.
An overview of various database technologies and their underlying mechanisms over time.
Presentation delivered at Alliander internally to inspire the use of and forster the interest in new (NOSQL) technologies. 18 September 2012
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
This presentation gives an brief overview of the history of relational databases, ACID and SQL and presents some of the key strentgths and potential weaknesses. It introduces the rise of NoSQL - why it arose, what is entails, when to use it. The presentation focuses on MongoDB as prime example of NoSQL document store and it shows how to interact with MongoDB from JavaScript (NodeJS) and Java.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Big Data Platforms: An Overview
1. Big Data Platforms:
An Overview
C. Scyphers
Chief Technical Architect
Daemon Consulting, LLC
2. What Is “Big Data”?
• Big Data is not simply a huge pile of information
• A good starting place is the following thought:
“Big Data describes datasets so large they become very
difficult to manage with traditional database tools.”
3. What Is A Big Data Platform?
Putting it simply, it is any platform which supports
those kind of large datasets.
15. NoSQL Does Not Mean “SQL Is Bad”
When the trend was just starting, “NoSQL” was coined. It’s
unfortunate, because it implies antagonism towards SQL.
16. NoSQL Means “Not Only SQL”
RELATIONAL
RELATIONAL
NON-
NoSQL is a complement to a traditional RDBMS, not
necessarily as a replacement of them.
26. NoSQL is based upon a better understanding
of data storage, usually referred to as the
“CAP Theorem”
27. The CAP Theorem
Grossly simplified (with apologies to Brewer):
A database can be
• Consistent (All clients see the same data)
• Available (All clients can find some available node)
• Partition-Tolerant (the database will continue to function
even if split into disconnected sets – e.g. a network disruption)
Pick Any Two.
28. CAP In Practice
• Consistent & Available (no Partition Tolerance)
• Either single machines or single site clusters.
• Typically uses 2 phase commits
29. CAP In Practice
• Consistent & Partition Tolerant (no Availability)
• Some data may be inaccessible, but the remainder
is available and consistent
• Sharding is an example of this implementation
Customers Customers Customers
A-F G-R S-Z
30. CAP In Practice
• Available & Partition Tolerant (no Consistency)
• Some data may be inaccurate; a conflict resolution
strategy is required.
• DNS is an example of this, as well as standard
master-slave replication
31. CAP From A Vendor POV
• C-A (no P) – this is generally how most
RDBMS vendors operate
• C-P (no A) – this is how many RDBMS’
attempt to address scale
without incurring large costs
• A-P (no C) – this is how most NoSQL
approaches solve the problem
32. ACID vs BASE
Traditional Databases NoSQL Databases Tend
Are ACID Compliant To Be BASE Compliant
Atomicity – either the entire transaction Basically
completes or none of it does
Consistent – any transaction will take the Available
database from one consistent
state to another, with no
broken constraints
Isolation – changes do not affect other users Scalable
until committed
Durability – committed transactions can be Eventually consistent
recovered in case of system
failure
Eventually consistent is the key phrase here
42. NoSQL Pros/Cons
Pros Cons
• Schema Evolution • Querying the data is
• Horizontal Scalability much harder
• Simple Protocols • Paradigm Shift
• Security is a big issue
• May or may not
support data types
(BLOBs, spatial)
• Generally, uniqueness
cannot be enforced
43. A Disclaimer Before We Continue
• I am not an expert on every possible Big Data
Platform
• There are hundreds of them; these are
the ones I consider the leaders in the field
and recommend
• If you have a favorite, please let me know
and I’ll update the deck for next time
• The internal details on how these systems
work are rather complex; I would prefer to
take those questions offline
44. Flavors Of NoSQL
The major four divisions of NoSQL are:
• Key-Value
• Document Store
• Columnar
• Other
45. Key-Value
• At a very high level, key-value works essentially
by pairing a index token (a key) with a data
element (a value).
• Both index token and the data value can be of
any structure.
• Such a pairing is arbitrary and up to the
developer of the system to determine.
46. A Key-Value Example
“John Smith”, “100 Century Dr. Alexandria VA 22304”
“John Doe”, “16 Kozyak Street, Lozenets District, 1408 Sofia Bulgaria”
In both examples, the key is a name and the value is an address.
However, the structure of the address differs between the two.
47. Document Store
• Document stores extend the key-value paradigm
into values with multiple attributes.
• The document values tend to be semi-structured
data (XML, JSON, et al) but can also be Word or
PDF documents.
48. A Document Store Example
“John Smith”, “<address><street>100 Century Dr.</street>
<city>Alexandria</city>
<state>VA</state>
<postalCode>22304</postalCode>
</address>”
“John Doe”, “{
“address”: {
“street”: “16 Kozyak Street”
“district”: “Lozenets, 1408”
“city”: “Sofia”
“country”: “Bulgaria”
}
}”
49. Columnar Family
• Usually has “rows” and “columns”
• Or, at least, their logical equivalents
• Not a traditional, “pure” column store
• More of a hybridized approach leveraging
key-value pairs
• A key with many values attached
52. Key-Value Pro/Con
Pros Cons
• Schema Evolution • Packing & unpacking each key
• Horizontal Scalability • Keys typically are not related
• Simple Protocols to each other
• Works well for volatile data • The entire value must be
• High throughput, typically returned, not just a part of it
optimized for reads or writes • Security tends to be an issue
• Keys become meaningful • Hard to support reporting,
rather than arbitrary analytics, aggregation or
• Application logic defines ordered values
object model • Generally does not support
updates in place
• Application logic defines
object model
53. Where Did Key-Value
Come From?
The concept is quite old, but most people trace the
lineage back to Amazon and the Dynamo paper.
54. Dynamo
Amazon devised the Dynamo engine as a way to
address their scalability issues in a reliable way.
• Communication between nodes is peer to peer
(P2P)
• Replication occurs with the end client
addressing conflict resolution
• Quorum Reads/Writes
• Always writable (Hinted Handoff)
• Eventually Consistent
55. Eventually Consistent
• Rather than expending the runtime resources
to ensure that all nodes are aware of a change
before continuing, Dynamo uses an eventually
consistent model.
• In this model, a subset of nodes are changed
• Those nodes then inform their neighbors until
all nodes are changed (grossly simplifying).
56. Can I Use Dynamo?
No. It’s an Amazon only internal product.
However, AWS S3 is largely based upon it.
Amazon did announce a DynamoDB offering for
their AWS customers. While it’s probably the
same, I cannot guarantee that it is.
57. Riak
• Riak is a key-value database largely
modeled after the Dynamo model.
• Open source (free) with paid support
from Basho.
• Main claims to fame:
• Extreme reliability
• Performance speed
58. Riak Pro/Con
Pros Cons
• All nodes are equal – no • Not meant for small, discrete
single point of failure and numerous datapoints.
• Horizontal Scalability • Getting data in is great;
• Full Text Search getting it out, not so much
• RESTful interface (and HTTP) • Security is non-existent:
• Consistency level tunable on “Riak assumes the internal
each operation environment is trusted”
• Secondary indexes available • Conflict resolution can bubble
• Map/Reduce (JavaScript & up to the client if not careful.
Erlang only) • Erlang is fast, but it’s got a
serious learning curve.
60. Redis
• Redis is a key-value in-memory datastore.
• Open source (free) with support from the
community.
• Main claims to fame:
• Fast. So very, very fast.
• Transactional support
• Best for rapidly changing data
61. Redis Pro/Con
Pros Cons
• Transactional support • Entirely in memory
• Blob storage • Master-slave replication
• Support for sets, lists and (instead of master-master)
sorted sets • Security is non-existent:
• Support for Publish-Subscribe designed to be used in
(Pub-Sub) messaging trusted environments
• Robust set of operators • Does not support encryption
• Support can be hard to find
63. Voldemort
• Voldemort is a key-value in-memory database
built by LinkedIn.
• Open source (free) with support from the
community
• Main claims to fame:
• Low latency
• Highly Available
• Very fast reads
64. Voldemort Pro/Con
Pros Cons
• Highly customizable – each • Versioning means lots of disk
layer of the stack can be space being used.
replaced as needed • Does not support range
• Data elements are versioned queries
during changes • No complex query filters
• All nodes are independent – • All joins must be done in
no single point of failure code
• Very, very fast reads • No foreign key constraints
• No triggers
• Support can be hard to find
67. Document Store Recap
Document stores store an index token
with a grouping of attributes in a semi-
structured document
68. Document Store Pro/Con
Pros Cons
• Tends to support a more • The entire value must be
complex data model than returned, not just a part of it
key/value • Security tends to be an issue
• Good at content • Joins are not available within
management the database
• Usually supports multiple • No foreign keys
indexes • Application logic defines
• Schemaless (can be nested) object model
• Typically low latency reads
• Application logic defines
object model
69. CouchDB
• CouchDB is a document store database.
• Open source (free), part of the Apache
foundation with paid support available from
several vendors.
• Main claims to fame:
• Simple and easy to use
• Good read consistency
• Master-master replication
70. CouchDB Pro/Con
Pros Cons
• Very simple API for • The simple API for
development development is somewhat
• MVCC support for read limited
consistency • No foreign keys
• Full Map/Reduce support • Conflict resolution devolves
• Data is versioned to the application
• Secondary indexes supported • Versioning requires extensive
• Some security support disk space
• RESTful API, JSON support • Versioning places large load
• Materialized views with on I/O channels
incremental update support • Replication for performance,
not availability
72. MongoDB
• MongoDB is a document store database.
• Open source (free) with paid support available
from 10Gen.
• Main claims to fame:
• Index anything
• Ad hoc query support
• SQL like operations
(not SQL syntax)
73. MongoDB Pro/Con
Pros Cons
• Auto-sharding • Does not support JSON: BSON
• Auto-failover instead
• Update in place • Master-slave replication
• Spatial index support • Has had some growing pains
• Ad hoc query support (e.g. Foursquare outage)
• Any field in Mongo can be • Not RESTful by default
indexed • Failures require a manual
• Very, very popular (lots of database repair operation
production deployments) (similar to MySQL)
• Very easy transition from SQL • Replication for availability,
not performance
76. Columnar Family Recap
• A key with many values attached
• Usually presenting as “rows” and “columns”
• Or, at least, their logical equivalents
77. Columnar Pro/Con
Pros Cons
• Tend to have some level of • Is much less efficient when
rudimentary security support processing many columns
• Usually include a degree of simultaneously
versioning • Joins tend to not be
• Can be more efficient than supported
row databases when • Referential integrity not
processing a limited number available
of columns over a large
amount of rows
78. Where Did Columnar
Come From?
The concept has been around for a while, but most
people trace the NoSQL lineage back to Google.
79. BigTable
Google devised the BigTable engine as a way to
address their search related scalability issues in a
reliable way.
• Data is organized through a set of keys:
• Row • Column • Timestamp
• A hybrid row/column store with a single master
• Versioning is handled through the time key
• Tablets are a dynamic partition of a sequence of
rows – supports very efficient range scans
• Columns can be grouped into column families
• Column families can have access control
80. Can I Use BigTable?
No. It’s a Google only internal product. However,
quite a few open source products are built upon
the concepts.
81. Cassandra
• Cassandra is a hybrid of Big Table built on
Dynamo infrastructure
• Open source (free), built by Facebook with paid
support available from several vendors.
• Main claims to fame:
• An Apache project
• Very, very fast writes
• Spans multiple datacenters
82. Cassandra Pro/Con
Pros Cons
• Designed to span multiple • No joins
datacenters • No referential integrity
• Peer to peer communication • Written in Java – quite
between nodes complex to administer
• No single point of failure and configure
• Always writeable • Last update wins
• Consistency level is tunable at
run time
• Supports secondary indexes
• Supports Map/Reduce
• Supports range queries
84. HBase
• Hbase is a columnar database built on top of the
Hadoop environment.
• Open source (free) with paid support from
numerous vendors
• Main claims to fame:
• Ad hoc type abilities
• Easy integration with
Map/Reduce
85. HBase Pro/Con
Pros Cons
• Map/Reduce support • Secondary indexes generally
• More of a CA approach and not supported
an AP • Security is non-existent
• Supports predicate push • Requires a Hadoop
down for performance gains infrastructure to function
• Automatic partitioning and
rebalancing of regions
• Data is stored in a sorted
order (not indexed)
• RESTful API
• Strong and vibrant ecosystem
87. Hadoop
• Hadoop is not a columnar store as such.
• Rather, Hadoop is a massively parallel data
processing engine
• Main claims to fame:
• Specializes in unstructured data
• Very flexible and popular
88. Hadoop Pro/Con
Pros Cons
• While written in Java, almost • Large amounts of disk space
any language can leverage and bandwidth required
Hadoop • Paradigm shift for IT staff
• Runs on commodity servers • Quality talent is highly in
• Horizontally scalable demand and expensive
• Very fast and powerful • Security is non-existent
• Where Map/Reduce • Name node is a single point
originated of failure
• Ample support from vendors • More or less only supporting
• “Helper” languages like Hive batch processing
and Pig • Not user friendly to anyone
• Strong and vibrant ecosystem other than developers
90. Columnar “Big Vendor”
• EMC Greenplum
• Teradata Aster
In so far as both of these solutions are grafting
Map/Reduce into a (more or less) SQL environment
91. Which One Do
I Use Where?
• Key-Value for (relatively) simple, volitile data
• Document store for more complex data
• Columnar for analytical processing
• RBDMS for traditional processing – particularly
where a lazy consistency is not acceptable
• Point Of Sale, for example
93. @scyphers
Additional Information At
http://www.daemonconsulting.net/BDC-FOSE-2012
Daemon Consulting, LLC
http://www.daemonconsulting.net/
Specializing In The Hard Stuff