Hybrid systems that integrate MapReduce and RDBMS aim to combine the best of both worlds. In-database MapReduce systems like Greenplum and HadoopDB run MapReduce programs directly on relational data for high performance and to leverage existing RDBMS features like SQL, security, backup/recovery and analytics tools. File-only systems like Pig and Hive are easier for developers but provide less integration with RDBMS functionality. Overall the relationship between MapReduce and RDBMS continues to evolve as each aims to address the other's limitations.
Hadoop implementation for algorithms apriori, pcy, sonChengeng Ma
PCY, A-Priori and SON algorithm are implemented by Pseudo mode Hadoop on the Ta-Feng Grocery dataset to find frequent itemsets. And the underlying association rules are found based on the discovered frequent itemsets. Written by Chengeng Ma.
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
e-Commerce Trends from 2014 to 2015 by Divante.coDivante
The new and actual version of this Report is here
https://www.slideshare.net/divanteltd/ecommerce-trends-from-2017-to-2018-by-divante
e-Commerce sales worldwide will reach $1.7 trillion in 2015. The World's Leading E-Commerce Companies, Capital Market, E-Commerce startups to watch, Omnichannel, B2C e-commerce sales worldwide and more!
Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascad...Kontagent
Description:
One of the biggest challenges for people building data products today is developing and refining features for modeling purposes (i.e. feature extraction) with the volume and variability of web scale data. In this talk, Martin will discuss some of the challenges and solutions faced by Kontagent as it built out a predictive lifetime value model for its customers. As you will learn, Hadoop is critical to this feature extraction process, and Cascading is quite handy when building out more complex features than can be readily developed in a query framework like Hive.
Speaker:
Martin Colaco, Director of Data Science for Kontagent
Hadoop implementation for algorithms apriori, pcy, sonChengeng Ma
PCY, A-Priori and SON algorithm are implemented by Pseudo mode Hadoop on the Ta-Feng Grocery dataset to find frequent itemsets. And the underlying association rules are found based on the discovered frequent itemsets. Written by Chengeng Ma.
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
e-Commerce Trends from 2014 to 2015 by Divante.coDivante
The new and actual version of this Report is here
https://www.slideshare.net/divanteltd/ecommerce-trends-from-2017-to-2018-by-divante
e-Commerce sales worldwide will reach $1.7 trillion in 2015. The World's Leading E-Commerce Companies, Capital Market, E-Commerce startups to watch, Omnichannel, B2C e-commerce sales worldwide and more!
Feature Extraction for Predictive LTV Modeling using Hadoop, Hive, and Cascad...Kontagent
Description:
One of the biggest challenges for people building data products today is developing and refining features for modeling purposes (i.e. feature extraction) with the volume and variability of web scale data. In this talk, Martin will discuss some of the challenges and solutions faced by Kontagent as it built out a predictive lifetime value model for its customers. As you will learn, Hadoop is critical to this feature extraction process, and Cascading is quite handy when building out more complex features than can be readily developed in a query framework like Hive.
Speaker:
Martin Colaco, Director of Data Science for Kontagent
API Days 2012 - 1 billion SMS through an API !Guilhem Ensuque
I gave this presentation at the API Days conference on Dec 3rd 2012
Once upon a time, people wanted to send a billion SMS per month through an API ... This session will tell you the story of how this can be achieved.
Taking example on the design behind the apiGrove opensource project, you will hear about the under the hood technology details relating to API policy distribution across large scale clusters of API gateways. In the cloud. With throughput above tens of thousands transactions per second. With high availability and high accuracy in rate limit enforcement.
Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is
"Refactoring to SOLID Code" session presentation from
Emerging .NET Devs - October 2011 User Group Meeting.
Please note that this presentation has been simplified for publishing.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Pivotal CRM: Optimize your Pivotal ImplementationAptean
Learn how to install, configure, maintain and support a Pivotal CRM environment while following Pivotal best practices. You will receive information regarding system optimization including which components need to be monitored and what to look for while monitoring the environment.
This presentation is related to my final year project - "Detection of Bots on Twitter". The project aims in classifying a Twitter user into "bot" or "human" with the help of machine learning and developing a real time web application.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
Presentation from RedDotRubyConf 2011 in Singapore. It explains multi-tenancy and why it is increasingly required for Rails development. Four of the many approaches are covered in some detail (including what resources we have available for re-use) and I end with a naive question (& call to action?) .. "Isn't it about time there was a 'Rails Way'?"
API Days 2012 - 1 billion SMS through an API !Guilhem Ensuque
I gave this presentation at the API Days conference on Dec 3rd 2012
Once upon a time, people wanted to send a billion SMS per month through an API ... This session will tell you the story of how this can be achieved.
Taking example on the design behind the apiGrove opensource project, you will hear about the under the hood technology details relating to API policy distribution across large scale clusters of API gateways. In the cloud. With throughput above tens of thousands transactions per second. With high availability and high accuracy in rate limit enforcement.
Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is important application. for Protection of invention of inventor and maintaining the quality as well as standard of. work of inventor.Intellectual property refers to creations of the. mind, inventions in artistic, literary, scientific and industrial field. It is
"Refactoring to SOLID Code" session presentation from
Emerging .NET Devs - October 2011 User Group Meeting.
Please note that this presentation has been simplified for publishing.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
Pivotal CRM: Optimize your Pivotal ImplementationAptean
Learn how to install, configure, maintain and support a Pivotal CRM environment while following Pivotal best practices. You will receive information regarding system optimization including which components need to be monitored and what to look for while monitoring the environment.
This presentation is related to my final year project - "Detection of Bots on Twitter". The project aims in classifying a Twitter user into "bot" or "human" with the help of machine learning and developing a real time web application.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
Presentation from RedDotRubyConf 2011 in Singapore. It explains multi-tenancy and why it is increasingly required for Rails development. Four of the many approaches are covered in some detail (including what resources we have available for re-use) and I end with a naive question (& call to action?) .. "Isn't it about time there was a 'Rails Way'?"
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
PHP Frameworks: I want to break free (IPC Berlin 2024)
MapReduce Debates and Schema-Free
1. http://www.coordguru.com
MapReduce Debates and Schema-Free
- Big Data, MapReduce, RDBMS+MapReduce, Non-Relational DB
Woohyun Kim
The creator of open source “Coord”
(http://www.coordguru.com)
2010-03-03
3. http://www.coordguru.com
Noah’s Ark Problem
• Did Noah take dinosaurs on the Ark?
• The Ark was a very large ship designed especially for its important purpose
• It was so large and complex that it took Noah 120 years to build
• How to put such a big thing
• Diet or DNA?
• Differentiate, Put, and Integrate
• Larger?
• More?
• ‚Big Data‛ problem is just like that
• Compression or Reduction
• gzip, Fingerprint, DNA, MD5, …
• Scale Up
• Scale Out
4. http://www.coordguru.com
Perspectives of Big Data
•SAN •SQL
•HDFS •MapReduce
•Hbase, Voldemort, MongoDB, •Pig
Cassandra •Hive, CloudBase
•HadoopDB
Store Process
Analyze Retrieve
•OLAP •SQL
•Text/Data Mining •MapReduce
•Social/Semantic Analysis •Key-Value
•Visualization •RESTFul
•Reporting
6. http://www.coordguru.com
Case Study: User Credit Analysis
A User Credit Model
User Credit
0.5 ∑
0.5
amount
quality
∑ 0.3 ∑
0.1 0.7
0.3 0.6
Open100_write Answer_ Question_cn confidence popularity
_cnt cnt t
∑ ∑
-0.5 0.5 -0.2 0.8
Confidence_negative Confidence_positive Popularity_negative popularity_positive
∑ ∑ ∑
0.5 0.5 0.5 0.5 1.0 0.7 0.3
Confidence_positive_ Confidence_negativ best_answer_ Total_kinup_poi
Penalty_cnt Admin_delete content e_user Report_cnt
_cnt cnt nt
∑
1.0 0.3 0.3
0.4
Aha_best_cnt Is_honor Dredt_level Is_sponsor
ETL
7. http://www.coordguru.com
Case Study: User Credit Analysis
Preprocessing Blog Data for Analyzing User Credit
Post * Attachment
pt_log1.csv
make_blog_
post_info.cpp
pt_attachfile1.csv Post/Attachment *
Buddy/Count/PowerBlogger/Commen
Blog Post t
att_pt_log.cpp
Buddy
pt_buddy.csv cal_buddy_
cnt.cpp
Buddy * Count
pt_count.csv att_visit_
count.cpp Buddy/Count * PowerBlogger
pt_power_blog1.csv att_is_power
blogger.cpp Buddy/Count/PowerBlogger * Comment
pt_comment1.csv att_commenting.cpp
Blogger
8. http://www.coordguru.com
New Changes surrounding Data Storages
• Volume
• Data volumes have grown from tens of gigabytes in the 1990s to hundreds of
terabytes and often petabytes in recent years
• Scale Out
• Relational databases are hard to scale
• Partitioning(for scalability)
‚Relations‛ get broken
• Replication(for availability)
• Speed
• The seek times of physical storage is not keeping pace with improvements in network
speeds
‚New Relations‛
• Integration
• Today’s data processing tasks increasingly have to access and combine data from
many different non-relational sources, often over a network
10. http://www.coordguru.com
Best Practice in Hadoop
• Software Stack in Google/Hadoop • Cookbook for ‚Big Data‛
• Structured Data Storage for ‚Big Data‛
Row key Column key
Row
Structured
Data
Time
Column Column stamp
Family Family
14. http://www.coordguru.com
Case Study: Further Study in Parallel Join
Problems
• Need to sort
• Move the partitioned data across the network
• Due to shuffling, must send the whole data
• Skewed by popular keys
• All records for a particular key are sent to the same reducer
• Overhead by tagging
Alternatives
• Map-side Join
• Mapper-only job to avoid sort and to reduce data movement across the
network
• Semi-Join
• Shrink data size through semi-join(by preprocessing)
15. http://www.coordguru.com
Case Study: Improvements in Parallel Join
Map-Side Join
• Replicate a relatively smaller input source to the cluster
• Put the replicated dataset into a local hash table
• Join – a relatively larger input source with each local hash table
• Mapper: do Mapper-side Join
Semi-Join
• Extract – unique IDs referenced in a larger input source(A)
• Mapper: extract Movie IDs from Ratings records
• Reducer: accumulate all unique Movie IDs
• Filter – the other larger input source(B) with the referenced unique IDs
• Mapper: filter the referenced Movie IDs from full Movie dataset
• Join - a larger input source(A) with the filtered datasets
• Mapper: do Mapper-side Join
• Ratings records & the filtered movie IDs dataset
17. http://www.coordguru.com
MapReduce is just A Major Step Backwards!!!
Dewitt and StoneBraker in January 17, 2008
• A giant step backward in the programming paradigm for
large-scale data intensive applications
• Schema are good
• Type check in runtime, so no garbage
• Separation of the schema from the application is good
• Schema is stored in catalogs, so can be queried(in SQL)
• High-level access languages are good
• Present what you want rather than an algorithm for how to get it
• No schema??!
• At least one data field by specifying the key as input
• For Bigtable/Hbase, different tuples within the same table can
actually have different schemas
• Even there is no support for logical schema changes such as
views
18. http://www.coordguru.com
MapReduce is just A Major Step Backwards!!! (cont’d)
Dewitt and StoneBraker in January 17, 2008
• A sub-optimal implementation, in that it uses brute force instead of
indexing
• Indexing
• All modern DBMSs use hash or B-tree indexes to accelerate access to data
• In addition, there is a query optimizer to decide whether to use an index or
perform a brute-force sequential search
• However, MapReduce has no indexes, so processes only in brute force fashion
• Automatic parallel execution
• In the 1980s, DBMS research community explored it such as Gamma, Bubba,
Grace, even commercial Teradata
• Skew
• The distribution of records with the same key causes is skewed in the map
phase, so it causes some reduce to take much longer than others
• Intermediate data pulling
• In the reduce phase, two or more reduce attempt to read input files form the
same map node simultaneously
19. http://www.coordguru.com
MapReduce is just A Major Step Backwards!!! (cont’d)
Dewitt and StoneBraker in January 17, 2008
• Not novel at all – it represents a specific implementation of well
known techniques developed nearly 25 years ago
• Partitioning for join
• Application of Hash to Data Base Machine and its Architecture, 1983
• Joins in parallel on a shared-nothing
• Multiprocessor Hash-based Join Algorithms, 1985
• The Case for Shared-Nothing, 1986
• Aggregates in parallel
• The Gamma Database Machine Project, 1990
• Parallel Database System: The Future of High Performance Database Systems,
1992
• Adaptive Parallel Aggregation Algorithms, 1995
• Teradata has been selling a commercial DBMS utilizing all of these
techniques for more than 20 years
• PostgreSQL supported user-defined functions and user-defined
aggregates in the mid 1980s
20. http://www.coordguru.com
MapReduce is just A Major Step Backwards!!! (cont’d)
Dewitt and StoneBraker in January 17, 2008
• Missing most of the features that are routinely included in current DBMS
• MapReduce provides only a sliver of the functionality found in modern DBMSs
• Bulk loader – transform input data in files into a desired format and load it into a DBMS
• Indexing – hash or B-Tree indexes
• Updates – change the data in the data base
• Transactions – support parallel update and recovery from failures during update
• integrity constraints – help keep garbage out of the data base
• referential integrity – again, help keep garbage out of the data base
• Views – so the schema can change without having to rewrite the application program
• Incompatible with all of the tools DBMS users have come to depend on
• MapReduce cannot use the tools available in a modern SQL DBMS, and has none of
its own
• Report writers(Crystal reports)
• Prepare reports for human visualization
• business intelligence tools(Business Objects or Cognos)
• Enable ad-hoc querying of large data warehouses
• data mining tools(Oracle Data Mining or IBM DB2 Intelligent Miner)
• Allow a user to discover structure in large data sets
• replication tools(Golden Gate)
• Allow a user to replicate data from on DBMS to another
• database design tools(Embarcadero)
• Assist the user in constructing a data base
22. http://www.coordguru.com
RDB experts Jump the MR Shark
Greg Jorgensen in January 17, 2008
• Arg1: MapReduce is a step backwards in database access
• MapReduce is not a database, a data storage, or management system
• MapReduce is an algorithmic technique for the distributed processing of large
amounts of data
• Arg2: MapReduce is a poor implementation
• MapReduce is one way to generate indexes from a large volume of data, but it’s not
a data storage and retrieval system
• Arg3: MapReduce is not novel
• Hashing, parallel processing, data partitioning, and user-defined functions are all old
hat in the RDBMS world, but so what?
• The big innovation MapReduce enables is distributing data processing across a
network of cheap and possibly unreliable computers
• Arg4: MapReduce is missing features
• Arg5: MapReduce is incompatible with the DBMS tools
• The ability to process a huge volume of data quickly such as web crawling and log
analysis is more important than guaranteeing 100% data integrity and completeness
23. http://www.coordguru.com
DBs are hammers; MR is a screwdriver
Mark C. Chu-Carroll
• RDBs don’t parallelize very well
• How many RDBs do you know that can efficiently split a
task among 1,000 cheap computers?
• RDBs don’t handle non-tabular data well
• RDBs are notorious for doing a poor job on recursive data
structures
• MapReduce isn’t intended to replace relational
databases
• It’s intended to provide a lightweight way of programming
things so that they can run fast by running in parallel on a
lot of machines
24. http://www.coordguru.com
MR is a Step Backwards, but some Steps Forward
Eugene Shekita
• Arg1: Data Models, Schemas, and Query Languages
• Semi-structured data model and high level of parallel data flow query language is
built on top of MapReduce
• Pig, Hive, Jaql, Cascading, Cloudbase
• Hadoop will eventually have a real data model, schema, catalogs, and query
language
• Moreover, Pig, Jaql, and Cascading are some steps forward
• Support semi-structured data
• Support more high level-like parallel data flow languages than declarative query
languages
• Greenplum and Aster Data support MapReduce, but look more limited than Pig, Jaql,
Cascading
• The calls to MapReduce functions wrapped in SQL queries will make it difficult
to work with semi-structured data and program multi-step dataflows
• Arg3: Novelty
• Teradata was doing parallel group-by 20 years ago
• UDAs and UDFs appeared in PostgreSQL in the mid 80s
• And yet, MapReduce is much more flexible, and fault-tolerant
• Support semi-structured data types, customizable partitioning
33. http://www.coordguru.com
Challenges in Traditional RDBMS
• Volume
• Data volumes have grown from tens of gigabytes in the 1990s to hundreds of
terabytes and often petabytes in recent years
• Speed
• The seek times of physical storage is not keeping pace with improvements in network
speeds
‚New Relations‛
34. http://www.coordguru.com
Challenges in Traditional RDBMS (cont’d)
• Scale Out
• Is it possible to achieve a large number of simple read/write operations per second?
• Traditional RDBMSs have not provided good horizontal scaling for OLTP
• Partitioning(for scalability)
‚Relations‛ get broken
• Replication(for availability)
• Data warehousing RDBMSs provide horizontal scaling of complex joins and queries
• Most of them are read-only or read-mostly
• Integration
• Today’s data processing tasks increasingly have to access and combine data from
many different non-relational sources, often over a network
35. http://www.coordguru.com
The New Faces of Data
• Scale out
• CAP Theorem
• CAP theorem simply states that any distributed data system can only achieve two of these
three at any given time
• Hence when building distributed systems, Just Pick 2/3
• Design Issues
• ACID
• BASE
Atomicity
Consistency
Isolation
Durability Basically
Available
Soft-state
Eventual Consistency
v0
36. http://www.coordguru.com
The New Faces of Data (cont’d)
• Sparsity
• Some data have sparse attributes
• document-term vector
Schema-Free
• user-item matrix
• semantic or social relations
• Some data do not need ‘relational’ property, or complex join queries
• log-structured data
• stacking or streamed data
• e.g. Facebook, Server Density(MySQL -> MongoDB)
• Immutable
• Do not need update and delete data, only insert it with versions
• tracking history
• lock-free
• atomicity is based on just a key
40. http://www.coordguru.com
Key Features of Non-Relational Databases
• Common Features
• A call level interface (in contrast to a SQL binding)
• HTTP/REST or easy to program APIs
• Fast indexes on large amounts of data
• Lookups by one and more keys(key-value or document)
• Ability to horizontally scale throughput over many servers
• Automatic sharding or client-side manual sharding
• Built-in replication(sync or async)
• Eventual Consistency
• Ability to dynamically define attributes or data schema
• Key-Value, Column, or Document
• Support for MapReduce
41. http://www.coordguru.com
Data Models of Non-Relational Databases
• Data Models
• Tuple
• A set of attribute-value pairs
• Attribute names are defined in a schema
• Values must be scalar(like numbers and strings), not BLOBs
• The values are referenced by attribute name, not by ordinal position
• Document
• A set of attribute-value pairs
• Attribute names are dynamically defined for each document at runtime
• Unlike Tuple, there is no global schema for attributes
• Values may be complex values or nested values
• Multiple indexes are supported
• Extensible Record
• A hybrid between Tuple and Document
• Families of attributes are defined in a schema
• New attributes can be defined (within an attribute family) on a per-record basis
• Object
• A set of attribute-value pairs
• Values may be complex values or pointers to other objects
42. http://www.coordguru.com
Classes of Non-Relational Databases
• Classification by Data Model
• Key-value Stores
• Store values and an index to find them
• Provide replication, versioning, locking, transactions, sorting, and etc.
• Document Stores
• Store indexed documents(with multiple indexes)
• Not support locking, synchronous replication, and ACID transactions
• Instead of ACID, support BASE for much higher performance and scalability
• Provide some simple query mechanisms
• Extensible Record Stores(=Column-oriented Stores)
• Store extensible records that can be horizontally and vertically partitioned across nodes
• Both rows and columns are splitted over multiple nodes
• Rows are split across nodes by range partitioning
• Columns of a table are distributed over multiple nodes by using ‚column groups‛
• Relational Databases
• Store, index, and query tuples
• Some new RDBMSs provide horizontal scaling
43. http://www.coordguru.com
A Comparison of Non-Relational Databases
Langu Replicatio Consistency & Data mode Doc
Project Partitioning Persistence Client Protocol Community
age n Transaction l s
Lock + limited ACID tr
Bigtable C++ Sync(GFS) Range Memtable/SSTable on GFS
ansactions
Custom API Column A Google, no
Lock + limited ACID tr
Hbase Java Sync(HDFS) Range Memtable/SSTable on HDFS
ansactions
Custom API, Thrift, Rest Column A Apache, yes
Lock + limited ACID tr
Hypertable C++ Sync(FS) Range CellCache/CellStore on any FS
ansactions
Thrift, other Column A Zvents, Baidu, yes
MVCC + limited ACID t Column & Key
Cassandra Java Async Hash On-disk
ransactions
Thrift
-Value
B Facebook, no
Key-Value or
Sync(on clie Hash (on client Custom API(python, php,jav
Coord C++
nt-side) -side)
Pluggable: in-memory, Lucene no
a, c++)
Document(jso A NHN, yes
n)
Dynamo ? Yes Yes ? Custom API Key-Value A Amazon, no
Key-Value(bl
Voldemort Java Async Hash Pluggable: BerkleyDB, Mysql MVCC Java API
ob/text)
A Linkedin, no
Hash (on client In-memory with background sna
Redis C Sync
-side) pshots
lock Custom API(Collection) Key-Value C some
In-memory or on-disk(hash , b-t
Manual shardin lock + limited
Tokyo Tyrant C Async
g
ree, fixed-size/variable-length r
ACID transactions
Key-Value C
ecord tables)
lock + limited ACID tr Key-Value(bl
Scalaris Erlang Sync Range Only in-memory
ansactions
Erlang, Java, HTTP
ob)
B OnScale, no
Key-Value(bl
Kai Erlang ? Yes On-disk Dets file Memcached
ob)
C no
Key-Value(bl
Dynomite Erlang Yes Yes Pluggable: couch, dets Custom ascii, Thrift
ob)
D+ Powerset, no
Key-Value(bl
MemcacheDB C Yes No BerkleyDB Memcached
ob)
B some
Pluggable: in-memory, ets, dets,
Key-Value &
Riak Erlang Async Hash osmos tables (no indices on 2nd MVCC Rest(json-based)
Document
B no
key fields)
No automated s
SimpleDB ? Async
harding
S3 no Custom API Document B Amazon, no
Pluggable: BerkleyDB, Custom,
ThruDB C++ Yes No
Mysql, S3
Thrift Document C+ Third rail, unsure
No automated s On-disk with append-only B-tre HTTP, json, Custom API(ma Document(jso
CouchDB Erlang Async
harding e
MVCC
p/reduce views) n)
A Apache, yes
HTTP, bson, Custom API(Cur Document(bs
MongoDB C++ Async Sharding new On-disk with B-tree Filed-level
sor) on)
A 10gen, yes
Neo4J On-disk linked lists Custom API(Graph) Graph
On-going classification by Woohyun Kim
44. http://www.coordguru.com
Document-oriented vs. RDBMS
CouchDB MongoDB MySQL
Terminology Document, Field, Database Document, Key, Collection
Data Model Document-Oriented (JSON) Document-Oriented (BSON) Relational
string, int, double, boolean, date, bytea
Data Types Text, numeric, boolean, and list Link
rray, object, array, others
Large Objects (Files) Yes (attachments) Yes (GridFS) no???
Master-master (with developer sup
Replication Master-slave Master-slave
plied conflict resolution)
Object(row) Storage One large repository Collection based Table based
Map/reduce of javascript functions Dynamic; object-based query language
Query Method Dynamic; SQL
to lazily build an index per query
Secondary Indexes Yes Yes Yes
Atomicity Single document Single document Yes – advanced
Interface REST Native drivers Native drivers
Server-side batch dat Yes, via javascript(thru. map/reduce
Yes, via javascript Yes (SQL)
a manipulation views)
Written in Erlang C++ C
Concurrency Control MVCC Update in Place Update in Place
46. http://www.coordguru.com
Appendix: What is Coord?
Architectural Comparison
• dust: a distributed file system based on DHT
• coord spaces: a resource sharable store system based on SBA
• coord mapreduce: a simplified large-scale data processing framework
• warp: a scalable remote/parallel execution system
• graph: a large-scale distributed graph search system
47. http://www.coordguru.com
Appendix: Coord Internals
A space-based architecture built on distributed hash tables
SBA(Space-based Architecture)
processes communicate with others thru. only spaces
DHT(Distributed Hash Tables)
data identified by hash functions are placed on numerically near nodes
A computing platform to project a single address space on
distributed memories
As if users worked in a single computing environment
App
take write
read
2m-1 0
node 1 node 2 node 3 node n