This document provides an overview of big data analysis using Hadoop and Pig. It introduces Dušan Zamurović and his work with big data reporting and quality assurance. It then discusses MapReduce and Hadoop for processing large datasets, and the Pig query language for analyzing data stored in Hadoop. Examples are provided of social network interaction analysis using Pig, including loading and parsing JSON event logs, grouping and summarizing interactions, and calculating average interaction weights.
This document discusses how Eventbrite built a social network using MongoDB to provide event recommendations. It stores user, event, and order data in MongoDB collections with indexes. It generates recommendations by querying neighbors' attended events and scoring events based on neighbor attendance. It also discusses challenges around dynamic neighbor relationships and performance needs that led it to choose MongoDB.
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphMongoDB
There are many possible approaches to storing and querying relationships between users in social networks. This section will dive into the details of storing a social user graph in MongoDB. It will cover the various schema designs for storing the follower networks of users and propose an optimal design for insert and query performance, as well as looking at performance differences between them.
Creating social features at BranchOut using MongoDBLewis Lin 🦊
The document discusses BranchOut's implementation of social features like following and activity feeds using MongoDB. For the follow system, they initially stored follows in MongoDB with advantages of compact data and read locality, but couldn't easily display followers. Their final solution stored followers across multiple documents to address the 16MB document size limit. For activity feeds, they modeled feed data in a way that aggregates events and scales horizontally, with average response times under 500ms. The document provides examples of how these social features were traditionally implemented in MySQL and the advantages of using MongoDB.
Choosing a shard key can be difficult, and the factors involved largely depend on your use case. In fact, there is no such thing as a perfect shard key; there are design tradeoffs inherent in every decision. This presentation goes through those tradeoffs, as well as the different types of shard keys available in MongoDB, such as hashed and compound shard keys
The document discusses NoSQL databases and MongoDB in particular. It provides examples of MongoDB documents with different data types like objects, arrays, etc. It also shows how data is stored across multiple shard servers and accessed via mongos query routers. Configuration servers store sharded cluster metadata. Log data can be stored in MongoDB but only aggregated results data may be needed. HDFS is better for temporary raw log storage.
MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. MongoDB obviates the need for an Object Relational Mapping (ORM) to facilitate development.
MVP Cloud OS Week Track 1 9 Sept: Data libertycsmyth501
This document discusses alternatives to limited scale in data solutions. It promotes embracing hybrid data platforms using varied technology partners rather than homogenous estates. It also advocates following robust engineering processes and establishing alternative skillsets while trusting cloud technologies over local maintenance. The document states that winners will have data that shows them they've won, and that new data traversal mechanisms will lead to new ways of expressing data. Everything known is still relevant despite changing constraints on how knowledge can be applied.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
This document discusses how Eventbrite built a social network using MongoDB to provide event recommendations. It stores user, event, and order data in MongoDB collections with indexes. It generates recommendations by querying neighbors' attended events and scoring events based on neighbor attendance. It also discusses challenges around dynamic neighbor relationships and performance needs that led it to choose MongoDB.
Socialite, the Open Source Status Feed Part 2: Managing the Social GraphMongoDB
There are many possible approaches to storing and querying relationships between users in social networks. This section will dive into the details of storing a social user graph in MongoDB. It will cover the various schema designs for storing the follower networks of users and propose an optimal design for insert and query performance, as well as looking at performance differences between them.
Creating social features at BranchOut using MongoDBLewis Lin 🦊
The document discusses BranchOut's implementation of social features like following and activity feeds using MongoDB. For the follow system, they initially stored follows in MongoDB with advantages of compact data and read locality, but couldn't easily display followers. Their final solution stored followers across multiple documents to address the 16MB document size limit. For activity feeds, they modeled feed data in a way that aggregates events and scales horizontally, with average response times under 500ms. The document provides examples of how these social features were traditionally implemented in MySQL and the advantages of using MongoDB.
Choosing a shard key can be difficult, and the factors involved largely depend on your use case. In fact, there is no such thing as a perfect shard key; there are design tradeoffs inherent in every decision. This presentation goes through those tradeoffs, as well as the different types of shard keys available in MongoDB, such as hashed and compound shard keys
The document discusses NoSQL databases and MongoDB in particular. It provides examples of MongoDB documents with different data types like objects, arrays, etc. It also shows how data is stored across multiple shard servers and accessed via mongos query routers. Configuration servers store sharded cluster metadata. Log data can be stored in MongoDB but only aggregated results data may be needed. HDFS is better for temporary raw log storage.
MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. MongoDB obviates the need for an Object Relational Mapping (ORM) to facilitate development.
MVP Cloud OS Week Track 1 9 Sept: Data libertycsmyth501
This document discusses alternatives to limited scale in data solutions. It promotes embracing hybrid data platforms using varied technology partners rather than homogenous estates. It also advocates following robust engineering processes and establishing alternative skillsets while trusting cloud technologies over local maintenance. The document states that winners will have data that shows them they've won, and that new data traversal mechanisms will lead to new ways of expressing data. Everything known is still relevant despite changing constraints on how knowledge can be applied.
This document discusses MongoDB performance tuning. It emphasizes that performance tuning is an obsession that requires planning schema design, statement tuning, and instance tuning in that order. It provides examples of using the MongoDB profiler and explain functions to analyze statements and identify tuning opportunities like non-covered indexes, unnecessary document scans, and low data locality. Instance tuning focuses on optimizing writes through fast update operations and secondary index usage, and optimizing reads by ensuring statements are tuned and data is sharded appropriately. Overall performance depends on properly tuning both reads and writes.
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
Benchmarking, benchmarking, benchmarking. We all do it, mostly it tells us what we want to hear but often hides a mountain of misinformation. In this talk we will walk through the pitfalls that you might find yourself in by looking at some examples where things go wrong. We will then walk through how MongoDB performance is measured, the processes and methodology and ways to present and look at the information.
Schema Design Best Practices with Buzz MoschettiMongoDB
We’ll discuss the tradeoff of various data modeling strategies in MongoDB. This talk will jumpstart your knowledge of how to work with documents, evolve your schema, and common schema design patterns. MongoDB’s basic unit of storage is a document. No prior knowledge of MongoDB is assumed.
This document discusses several key questions to consider when creating an offline application, including what functionality should be available offline, how to store application data locally, and how to handle synchronization between offline and online data. It provides examples of different storage options for offline data, such as the Application Cache, Service Workers Cache API, web storage, web SQL, file system API, and IndexedDB. It also discusses approaches for resolving conflicts when synchronizing offline and online data, such as using timestamps or optimistic/pessimistic locking strategies. The document is an informative resource for developers building offline-capable web applications.
MongoDB Schema Design: Four Real-World ExamplesMike Friedman
This document discusses different schema designs for common use cases in MongoDB. It presents four cases: (1) modeling a message inbox, (2) retaining historical data within limits, (3) storing variable attributes efficiently, and (4) looking up users by multiple identities. For each case, it analyzes different modeling approaches, considering factors like query performance, write performance, and whether indexes can be used. The goal is to help designers choose an optimal schema based on their application's access patterns and scale requirements.
This document provides an introduction to MongoDB, including:
1) MongoDB is a schemaless database that supports features like replication, sharding, indexing, file storage, and aggregation.
2) The main concepts include databases containing collections of documents like tables containing rows in SQL databases, but documents can have different structures.
3) Examples demonstrate inserting, querying, updating, and embedding documents in MongoDB collections.
MongoDB is a document-oriented NoSQL database that uses a document-data model. It provides horizontal scaling with auto-sharding and replication. MongoDB can store documents in collections without a defined schema and support dynamic queries and indexing. RealNetworks uses MongoDB with Scala and other technologies for an Android app to send notifications to devices with installed RealNetworks applications at scale.
Webinar: Back to Basics: Thinking in DocumentsMongoDB
New applications, users and inputs demand new types of data, like unstructured, semi-structured and polymorphic data. Adopting MongoDB means adopting to a new, document-based data model.
While most developers have internalized the rules of thumb for designing schemas for relational databases, these rules don't apply to MongoDB. Documents can represent rich data structures, providing lots of viable alternatives to the standard, normalized, relational model. In addition, MongoDB has several unique features, such as atomic updates and indexed array keys, that greatly influence the kinds of schemas that make sense.
In this session, Buzz Moschetti explores how you can take advantage of MongoDB's document model to build modern applications.
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
Encryption is not a new concept to MongoDB. Encryption may occur in-transit (with TLS) and at-rest (with the encrypted storage engine). But MongoDB 4.2 introduces support for Client Side Encryption, ensuring the most sensitive data is encrypted before ever leaving the client application. Even full access to your MongoDB servers is not enough to decrypt this data. And better yet, Client Side Encryption can be enabled at the "flick of a switch".
This session covers using Client Side Encryption in your applications. This includes the necessary setup, how to encrypt data without sacrificing queryability, and what trade-offs to expect.
This document discusses MongoDB, a document-oriented NoSQL database. It begins with an introduction to NoSQL databases and their advantages over relational databases. MongoDB is then described in more detail, including its features like storing dynamic JSON documents, indexing, replication, querying and MapReduce capabilities. Examples of CRUD operations on MongoDB documents are provided. The document concludes by discussing when MongoDB may be applicable and who uses it, as well as comparing MongoDB to SQL databases.
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
Talk given at MongoDb Munich on 16.10.2012 about the different approaches in MongoDB for using the Map/Reduce algorithm. The talk compares the performance of built-in MongoDB Map/Reduce, group(), aggregate(), find() and the MongoDB-Hadoop Adapter using a practical use case.
This presentation was given at the LDS Tech SORT Conference 2011 in Salt Lake City. The slides are quite comprehensive covering many topics on MongoDB. Rather than a traditional presentation, this was presented as more of a Q & A session. Topics covered include. Introduction to MongoDB, Use Cases, Schema design, High availability (replication) and Horizontal Scaling (sharding).
Back to Basics Webinar 2: Your First MongoDB ApplicationMongoDB
The document provides instructions for installing and using MongoDB to build a simple blogging application. It demonstrates how to install MongoDB, connect to it using the mongo shell, insert and query sample data like users and blog posts, update documents to add comments, and more. The goal is to illustrate how to model and interact with data in MongoDB for a basic blogging use case.
MongoDB + Java - Everything you need to know Norberto Leite
Learn everything you need to know to get started building a MongoDB-based app in Java. We'll explore the relationship between MongoDB and various languages on the Java Virtual Machine such as Java, Scala, and Clojure. From there, we'll examine the popular frameworks and integration points between MongoDB and the JVM including Spring Data and object-document mappers like Morphia.
Getting Started with Geospatial Data in MongoDBMongoDB
MongoDB supports geospatial data and specialized indexes that make building location-aware applications easy and scalable.
In this session, you will learn the fundamentals of working with geospatial data in MongoDB. We will explore how to store and index geospatial data and best practices for using geospatial query operators and methods. By the end of this session, you should be able to implement basic geolocation functionality in an application.
In this webinar, you will learn:
- Getting geospatial data into MongoDB and how to build geospatial indexes.
- The fundamentals of MongoDB's geospatial query operators and how to design queries that meet the needs of your application.
- Advanced geospatial capabilities with Java geospatial libraries and MongoDB.
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...MongoDB
The document discusses MongoDB basics including:
1) Inserting and querying documents using operators like $lt and $in
2) Returning documents through cursors and using projections to select attributes
3) Updating documents using operators like $push, $inc, and $addToSet along with bucketing and pre-aggregated reports
It also covers durability options like acknowledged writes and waiting for replication.
The document discusses various options for processing and aggregating data in MongoDB, including the Aggregation Framework, MapReduce, and connecting MongoDB to external systems like Hadoop. The Aggregation Framework is described as a flexible way to query and transform data in MongoDB using a JSON-like syntax and pipeline stages. MapReduce is presented as more versatile but also more complex to implement. Connecting to external systems like Hadoop allows processing large amounts of data across clusters.
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
The document discusses Druid, an open-source distributed column-oriented data store designed for low latency queries on large datasets. It outlines Druid's capabilities for real-time ingestion, aggregation queries in sub-seconds, and storing petabytes of historical data. Examples are given of companies like Netflix and PayPal using Druid at large scales to analyze streaming data. The key components, data formats, and query types of Druid are described.
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets given by MapR Chief Data Engineer EMEA . Big Data User Group in Stuttgart 2013-05-16
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
This document discusses how Hadoop and Hive can change data warehousing by allowing organizations to store large amounts of structured and semi-structured data in its native format, and perform interactive queries directly against that data using tools like Impala and Hive without needing to first extract, transform and load the data into data marts. It provides examples of how Klout and an online gaming company used Hadoop and Hive for analytics and compares performance of Impala versus traditional data warehouse databases.
Mythbusting: Understanding How We Measure the Performance of MongoDBMongoDB
Benchmarking, benchmarking, benchmarking. We all do it, mostly it tells us what we want to hear but often hides a mountain of misinformation. In this talk we will walk through the pitfalls that you might find yourself in by looking at some examples where things go wrong. We will then walk through how MongoDB performance is measured, the processes and methodology and ways to present and look at the information.
Schema Design Best Practices with Buzz MoschettiMongoDB
We’ll discuss the tradeoff of various data modeling strategies in MongoDB. This talk will jumpstart your knowledge of how to work with documents, evolve your schema, and common schema design patterns. MongoDB’s basic unit of storage is a document. No prior knowledge of MongoDB is assumed.
This document discusses several key questions to consider when creating an offline application, including what functionality should be available offline, how to store application data locally, and how to handle synchronization between offline and online data. It provides examples of different storage options for offline data, such as the Application Cache, Service Workers Cache API, web storage, web SQL, file system API, and IndexedDB. It also discusses approaches for resolving conflicts when synchronizing offline and online data, such as using timestamps or optimistic/pessimistic locking strategies. The document is an informative resource for developers building offline-capable web applications.
MongoDB Schema Design: Four Real-World ExamplesMike Friedman
This document discusses different schema designs for common use cases in MongoDB. It presents four cases: (1) modeling a message inbox, (2) retaining historical data within limits, (3) storing variable attributes efficiently, and (4) looking up users by multiple identities. For each case, it analyzes different modeling approaches, considering factors like query performance, write performance, and whether indexes can be used. The goal is to help designers choose an optimal schema based on their application's access patterns and scale requirements.
This document provides an introduction to MongoDB, including:
1) MongoDB is a schemaless database that supports features like replication, sharding, indexing, file storage, and aggregation.
2) The main concepts include databases containing collections of documents like tables containing rows in SQL databases, but documents can have different structures.
3) Examples demonstrate inserting, querying, updating, and embedding documents in MongoDB collections.
MongoDB is a document-oriented NoSQL database that uses a document-data model. It provides horizontal scaling with auto-sharding and replication. MongoDB can store documents in collections without a defined schema and support dynamic queries and indexing. RealNetworks uses MongoDB with Scala and other technologies for an Android app to send notifications to devices with installed RealNetworks applications at scale.
Webinar: Back to Basics: Thinking in DocumentsMongoDB
New applications, users and inputs demand new types of data, like unstructured, semi-structured and polymorphic data. Adopting MongoDB means adopting to a new, document-based data model.
While most developers have internalized the rules of thumb for designing schemas for relational databases, these rules don't apply to MongoDB. Documents can represent rich data structures, providing lots of viable alternatives to the standard, normalized, relational model. In addition, MongoDB has several unique features, such as atomic updates and indexed array keys, that greatly influence the kinds of schemas that make sense.
In this session, Buzz Moschetti explores how you can take advantage of MongoDB's document model to build modern applications.
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
Encryption is not a new concept to MongoDB. Encryption may occur in-transit (with TLS) and at-rest (with the encrypted storage engine). But MongoDB 4.2 introduces support for Client Side Encryption, ensuring the most sensitive data is encrypted before ever leaving the client application. Even full access to your MongoDB servers is not enough to decrypt this data. And better yet, Client Side Encryption can be enabled at the "flick of a switch".
This session covers using Client Side Encryption in your applications. This includes the necessary setup, how to encrypt data without sacrificing queryability, and what trade-offs to expect.
This document discusses MongoDB, a document-oriented NoSQL database. It begins with an introduction to NoSQL databases and their advantages over relational databases. MongoDB is then described in more detail, including its features like storing dynamic JSON documents, indexing, replication, querying and MapReduce capabilities. Examples of CRUD operations on MongoDB documents are provided. The document concludes by discussing when MongoDB may be applicable and who uses it, as well as comparing MongoDB to SQL databases.
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
Talk given at MongoDb Munich on 16.10.2012 about the different approaches in MongoDB for using the Map/Reduce algorithm. The talk compares the performance of built-in MongoDB Map/Reduce, group(), aggregate(), find() and the MongoDB-Hadoop Adapter using a practical use case.
This presentation was given at the LDS Tech SORT Conference 2011 in Salt Lake City. The slides are quite comprehensive covering many topics on MongoDB. Rather than a traditional presentation, this was presented as more of a Q & A session. Topics covered include. Introduction to MongoDB, Use Cases, Schema design, High availability (replication) and Horizontal Scaling (sharding).
Back to Basics Webinar 2: Your First MongoDB ApplicationMongoDB
The document provides instructions for installing and using MongoDB to build a simple blogging application. It demonstrates how to install MongoDB, connect to it using the mongo shell, insert and query sample data like users and blog posts, update documents to add comments, and more. The goal is to illustrate how to model and interact with data in MongoDB for a basic blogging use case.
MongoDB + Java - Everything you need to know Norberto Leite
Learn everything you need to know to get started building a MongoDB-based app in Java. We'll explore the relationship between MongoDB and various languages on the Java Virtual Machine such as Java, Scala, and Clojure. From there, we'll examine the popular frameworks and integration points between MongoDB and the JVM including Spring Data and object-document mappers like Morphia.
Getting Started with Geospatial Data in MongoDBMongoDB
MongoDB supports geospatial data and specialized indexes that make building location-aware applications easy and scalable.
In this session, you will learn the fundamentals of working with geospatial data in MongoDB. We will explore how to store and index geospatial data and best practices for using geospatial query operators and methods. By the end of this session, you should be able to implement basic geolocation functionality in an application.
In this webinar, you will learn:
- Getting geospatial data into MongoDB and how to build geospatial indexes.
- The fundamentals of MongoDB's geospatial query operators and how to design queries that meet the needs of your application.
- Advanced geospatial capabilities with Java geospatial libraries and MongoDB.
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...MongoDB
The document discusses MongoDB basics including:
1) Inserting and querying documents using operators like $lt and $in
2) Returning documents through cursors and using projections to select attributes
3) Updating documents using operators like $push, $inc, and $addToSet along with bucketing and pre-aggregated reports
It also covers durability options like acknowledged writes and waiting for replication.
The document discusses various options for processing and aggregating data in MongoDB, including the Aggregation Framework, MapReduce, and connecting MongoDB to external systems like Hadoop. The Aggregation Framework is described as a flexible way to query and transform data in MongoDB using a JSON-like syntax and pipeline stages. MapReduce is presented as more versatile but also more complex to implement. Connecting to external systems like Hadoop allows processing large amounts of data across clusters.
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
The document discusses Druid, an open-source distributed column-oriented data store designed for low latency queries on large datasets. It outlines Druid's capabilities for real-time ingestion, aggregation queries in sub-seconds, and storing petabytes of historical data. Examples are given of companies like Netflix and PayPal using Druid at large scales to analyze streaming data. The key components, data formats, and query types of Druid are described.
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets given by MapR Chief Data Engineer EMEA . Big Data User Group in Stuttgart 2013-05-16
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
This document discusses how Hadoop and Hive can change data warehousing by allowing organizations to store large amounts of structured and semi-structured data in its native format, and perform interactive queries directly against that data using tools like Impala and Hive without needing to first extract, transform and load the data into data marts. It provides examples of how Klout and an online gaming company used Hadoop and Hive for analytics and compares performance of Impala versus traditional data warehouse databases.
Webinar: How Banks Use MongoDB as a Tick DatabaseMongoDB
Learn why MongoDB is spreading like wildfire across capital markets (and really every industry) and then focus in particular on how financial firms are enjoying the developer productivity, low TCO, and unlimited scale of MongoDB as a tick database for capturing, analyzing, and taking advantage of opportunities in tick data.
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
Monitoring is the key to successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast growing, high performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHouse Webinar Slides
Monitoring is the key to the successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open-source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast-growing, high-performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Presented by:
Roman Khavronenko, Co-Founder at VictoriaMetrics
Robert Hodges, CEO at Altinity
AsterixDB is an open source "Big Data Management System" (BDMS) that provides flexible data modeling, efficient query processing, and scalable analytics on large datasets. It uses a native storage layer built on LSM trees with indexing and transaction support. The Asterix Data Model (ADM) supports semistructured data types like records, lists, and bags. Queries are written in the Asterix Query Language (AQL) which supports features like spatial and temporal predicates. AsterixDB is being used for applications like social network analysis, education analytics, and more.
Slide presentasi ini dibawakan oleh Jony Sugianto dalam Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
Implementing and Visualizing Clickstream data with MongoDBMongoDB
Having recently implemented a new framework for the real-time collection, aggregation and visualization of web and mobile generated Clickstream traffic (realizing daily click-stream volumes of 1M+ events), this walkthrough is about the motivations, throughout-process and key decisions made, as well as an in depth look at the implementation of how to buildout a data-collection, analytics and visualization framework using MongoDB. Technologies covered in this presentation (as well as MongoDB) are Java, Spring, Django and Pymongo.
This document summarizes a cloud-native stream processor. It discusses how the stream processor is lightweight, open source, and supports distributed deployment on Docker and Kubernetes. It also outlines key features like real-time data integration, complex pattern detection, online machine learning, and integration with databases and services. Use cases like fraud detection, IoT analytics, and real-time decision making are provided.
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
This document provides an overview of OrientDB, a multi-model database that combines features of document, graph, and other databases. It discusses data modeling and schema, querying and traversing graph data, full-text and spatial search, deployment scenarios, and APIs. Examples show creating classes and properties, inserting and querying graph data, and live reactive queries in OrientDB.
JSON_TO_HIVE_SCHEMA_GENERATOR is a tool that effortlessly converts your JSON data to Hive schema, which then can be used with HIVE to carry out processing of data. It is designed to automatically generate hive schema from JSON Data. It keeps into account various issues(multiple JSON objects per file, NULL Values, the absence of certain fields etc..) and can parse millions of records and obtain a schema definition for data i:e nested structures.
Follow : https://github.com/jainpayal12/Json_To_HiveSchema_Generator.git
Operational Intelligence with MongoDB WebinarMongoDB
This document discusses using MongoDB for operational intelligence and real-time analytics of log and event data. It describes how MongoDB can ingest large volumes of data from multiple sources at high write volumes. Queries can then be performed rapidly to analyze the data and drill down into specific events. The aggregation framework is used to generate rollups and reports from the data on-demand or on a scheduled basis.
Hidden pearls for High-Performance-PersistenceSven Ruppert
Small UseCases with a significant amount of data for internal company usage, most developers had this in their career, already. However, no Ops Team, no Kubernetes, no Cluster is available as part of the solution.
In this talk, I will show a few tech stacks that are helping to deal with persistent data without dealing with the classic horizontal scaling tech monsters like Kubernetes, Hadoop and many more.
Sit down, relax and enjoy the journey through a bunch of lightning-fast persistence alternatives for pure java devs.
The document provides an overview of Aerospike, a real-time database vendor, from their perspective. It discusses the different types of database workloads, including transactions, analytics, and real-time big data. It outlines the challenges of handling high transaction volumes at low latency while scaling data size. The document then describes Aerospike's in-memory architecture, synchronous replication for consistency, and horizontal and vertical scaling capabilities. Several case studies of companies using Aerospike in production are also mentioned.
This session takes an in-depth look at:
- Trends in stream processing
- How streaming SQL has become a standard
- The advantages of Streaming SQL
- Ease of development with streaming SQL: Graphical and Streaming SQL query editors
- Business value of streaming SQL and its related tools: Domain-specific UIs
- Scalable deployment of streaming SQL: Distributed processing
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
2.
Name: Dušan Zamurović
Where I come from?
◦ codecentric Novi Sad
What I do?
◦ Java web-app background
◦ ♥ JavaScript ♥
Ajax with DWR lib
◦ Android
◦ currently Big Data (reporting QA)
5. A revolution that will transform how we live, work,
and think.
3 Vs of big data
◦ Volume
◦ Variety
◦ Velocity
Every day use-cases
◦ Beautiful
◦ Useful
◦ Funny
6.
The principal characteristic
Studies report
◦ 1.2 trillion gigabytes of new data was created
worldwide in 2011 alone
◦ From 2005 to 2020, the digital universe will grow
by a factor of 300
◦ By 2020 the digital universe will amount to 40
trillion gigabytes (more than 5,200 gigabytes for
every man, woman, and child in 2020)
7.
The biggest growth – unstructured data
◦
◦
◦
◦
◦
◦
Documents
Web logs
Sensor data
Videos and photos
Medical devices
Social media
>90% of this Big Data is unstructured
Analytic value?
◦ 33% valuable info by 2020
8.
Generated at high speed
Needs real-time processing
Example I
◦ Financial world
◦ Thousands or millions of transactions
Example II
◦ Retail
◦ Analyze click streams to offer recommendations
9. Value of Big Data is potentially great but can be
released only with the right combination of
people, processes and technologies.
…unlock significant value by making
information transparent and usable at much
higher frequency
10.
Measuring heartbeat of a city - Rio de Janeiro
More examples
◦
◦
◦
◦
Product development – most valuable features
Manufacturing – indicators of quality problems
Distribution – optimize inventory and supply chains
Sales – account targeting, resource allocation
Beer and diapers
Possible issues?
◦ Privacy, security, intellectual property, liability…
11. "Map/Reduce is a programming model and an
associated implementation for processing and
generating large data sets. Users specify a map
function that processes a key/value pair to
generate a set of intermediate key/value pairs,
and a reduce function that merges all
intermediate values associated with the same
intermediate key.“
- research publication
http://research.google.com/archive/mapreduce.html
12.
13.
In the beginning, there was Nutch
Which problems does it address?
◦ Big Data
◦ Not fit for RDBMS
◦ Computationally extensive
Hadoop && RDBMS
◦ “Get data to process” or “send code where data is”
◦ Designed to run on large number of machines
◦ Separate storage
14.
Distributed File System
◦ Designed for commodity hardware
◦ Highly fault-tolerant
◦ Relaxed POSIX
To enable streaming access to file system data
Assumptions and Goals
◦
◦
◦
◦
◦
Hardware failure
Streaming data access
Large data sets
Write-once-read-many
Move computation, not data
15.
NameNode
◦
◦
◦
◦
◦
Master server, central component
HDFS cluster has single NameNode
Manages client’s access
Keeps track where data is kept
Single point of failure
Secondary NameNode
◦ Optional component
◦ Checkpoints of the namespace
Does not provide any real redundancy
16.
DataNode
◦ Stores data in the file system
◦ Talks to NameNode and responds to requests
◦ Talks to other DataNodes
Data replication
TaskTracker
◦
◦
◦
◦
Should be where DataNode is
Accepts tasks (Map, Reduce, Shuffle…)
Set of slots for tasks
♥__ ♥__ ♥__ ________ ♥_ ♥ ♥ ♥__________________
17.
JobTracker
◦ Farms tasks to specific nodes in the cluster
◦ Point of failure for MapReduce
How it goes?
1.
2.
3.
4.
5.
Client submits jobs JobTracker
JobTracker, whereis NameNode
JobTracker locates TaskTracker
JobTracker, tasks TaskTracker
TaskTracker ♥__ ♥__ ♥__
1. Job failed, TaskTracker informs, JobTracker decides
2. Job done, JobTracker updates status
6. Client can poll JobTracker for information
18.
Platform for analyzing large data sets
◦
◦
◦
◦
Language – Pig Latin
High level approach
Compiler
Grunt shell
Pig compared to SQL
◦ Lazy evaluation
◦ Procedural language
◦ More like an execution plan
19.
Pig Latin statements
◦
◦
◦
◦
◦
A
A
A
A
A
relation is a bag
bag is collection of tuples
tuple is on ordered set of fields
field is piece of data
relation is referenced by name, i.e. alias
A = LOAD 'student' USING PigStorage() AS
(name:chararray, age:int, gpa:float);
DUMP A;
(John,18,4.0F)
(Mary,19,3.8F)
(Bill,20,3.9F)
(Joe,18,3.8F)
20.
Data types
◦ Simple
int – signed 32-bit integer
long – signed 64-bit integer
float – 32-bit floating point
double – 64-bit floating point
charrarray – UTF-8 string
bytearray – blob
boolean – since Pig 0.10
datetime
◦ Complex
tuple – an ordered set of fields
bag – a collection of tuples
map – a set of key value pairs
(21,32)
{(21,32),(32,43)}
[pig#latin]
21.
Data structure and defining schemas
◦ Why to define schema?
◦ Where to define schema?
◦ How to define schema?
/* data types not specified */
a = LOAD '1.txt' AS (a0, b0);
a: {a0: bytearray,b0: bytearray}
/* number of fields not known */
a = LOAD '1.txt';
a: Schema for a unknown
24.
User Defined Functions
◦ Java, Python, JavaScript, Ruby, Groovy
How to write an UDF?
◦ Eval function extends EvalFunc<something>
◦ Load function extends LoadFunc
◦ Store function extends StoreFunc
How to use an UDF?
◦ Register
◦ Define the name of the UDF if you like
◦ Call it
25.
26.
Imaginary social network
A lots of users…
… with their friends, girlfriends, boyfriends, wives,
husbands, mistresses, etc…
New relationship arises…
◦ … but new friend is not shown in news feed
Where are his/her activities?
◦ Hidden, marked as not important
27.
Find out the value of the relationship
Monitor and log user activities
◦
◦
◦
◦
◦
◦
◦
For each user, of course
Each activity has some value (event weight)
Records user’s activities
Store those logs in HDFS
Analyze those logs from time to time
Calculate needed values
Show only the activities of “important” friends
28.
Events recorded in JSON format
{
"timestamp": 1341161607860,
"sourceUser": "marry.lee",
"targetUser": "ruby.blue",
"eventName": "VIEW_PHOTO",
"eventWeight": 1
}
38. REGISTER codingserbia-udf.jar
DEFINE AVG_WEIGHT com.codingserbia.udf.AverageWeight();
interactionRecords = LOAD ‘/blog/user_interaction_big.json’
USING com.codingserbia.udf.JsonLoader();
interactionData = FOREACH interactionRecords GENERATE
sourceUser,
targetUser,
eventWeight;
groupInteraction = GROUP interactionData BY (sourceUser,
targetUser);
…
39. …
summarizedInteraction = FOREACH groupInteraction GENERATE
group.sourceUser AS sourceUser,
group.targetUser AS targetUser,
SUM(interactionData.eventWeight) AS eventWeight,
COUNT(interactionData.eventWeight) AS eventCount,
AVG_WEIGHT(
SUM(interactionData.eventWeight),
COUNT(interactionData.eventWeight)) AS averageWeight;
result = ORDER summarizedInteraction BY
sourceUser, eventWeight DESC;
STORE result INTO '/results/pig_mr’ USING PigStorage();
Editor's Notes
Big data. One of the buzz words of the software industry in the last decade. We all heard about it but I am not sure if we actually can comprehend it as we should and as it deserves. It reminds me of the Universe – mankind has knowledge that it is big, huge, vast, but no one can really understand the size of it. Same can be said for the amount of data being collected and processed every day somewhere in the clouds if IT. As Google’s CEO, Eric Schmidt, once said: “There were 5 exabytes of information created by the entire world between the dawn of civilization and 2003. Now that same amount is created every two days.”
Almost every organization has to deal with huge amounts of data. Much of this exists in conventional structured forms, stored in relational databases. However, the biggest growth comes from unstructured data, both from inside and outside the enterprise - including documents, web logs, sensor data, videos, medical devices and social media. According to some studies, more than 90% of Big Data is unstructured data.The majority of information in the digital universe, 68% in 2012, is created and consumed by consumers watching digital TV, interacting with social media, sending camera phone images and videos between devices and around the Internet, and so on.But, only a fraction if it is explored for analytic value. Some studies say that only 33% of digital universe will be contain valuable info by 2020.
As well as volume and variety, Big Data is often said to exhibit "velocity" - meaning that the data is being generated at high speed, and needs real-time processing and analysis. One example of the need for real-time processing of Big Data is in the financial world, where thousands or millions of transactions must be continuously analyzed for possible fraud in a matter of seconds. Another example is in retail, where a business may be analyzing many customer click-streams and purchases to generate real-time intelligent recommendations.
organizations create and store more data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performancecompanies are using data collection and analysis to conduct controlled experiments to make better management decisions
Measuring heartbeat of a city - Rio de Janeiro6.5M people8M vehicles4M bus passengers44k police...tropical monsoon climateBigData is used to monitor weather, traffic (GPS tracked busses and medical vehicles), police, emergency services - using analytics to predict problems before they occur.Not so beautiful example, but Big Data influences business and decision making- Product Development: incorporate the features that matter most- Manufacturing: flag potential indicators of quality problems- Distribution: quantify optimal inventory and supply chain activities- Marketing: identify your most effective campaigns for engagement and sales- Sales: optimize account targeting, resource allocation, revenue forecastingSeveral issues will have to be addressed to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world. Organizations need not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data. Access to data is critical—companies will increasingly need to integrate information from multiple data sources, often from third parties, and the incentives have to be in place to enable this.
The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms. Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.In a centralized database system, you’ve got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That’s MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy.
A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.DataNode instances can talk to each other, which is what they do when they are replicating data.TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.The JobTracker is a point of failure for the HadoopMapReduce service. If it goes down, all running jobs are halted.Client applications submit jobs to the Job tracker.The JobTracker talks to the NameNode to determine the location of the dataThe JobTracker locates TaskTracker nodes with available slots at or near the dataThe JobTracker submits the work to the chosen TaskTracker nodes.The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.When the work is completed, the JobTracker updates its status.Client applications can poll the JobTracker for information.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.Lazy evaluationPig's ability to store data at any pointProcedural language, more like an execution plan which offers more control over the flow of processing dataSQL offers an option to join two tables but Pig offers also a choice of implementation of join
A Pig relation is a bag of tuples. A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table. Unlike a relational table, however, Pig relations don't require that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.Also note that relations are unordered which means there is no guarantee that tuples are processed in any particular order. Furthermore, processing may be parallelized in which case tuples are not processed according to any total ordering.
Schemas enable you to assign names to fields and declare types for fields. Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution.Schemas are defined with the LOAD, STREAM, and FOREACH operators using the AS clause.You can define a schema that includes both the field name and field type.You can define a schema that includes the field name only; in this case, the field type defaults to bytearray.You can choose not to define a schema; in this case, the field is un-named and the field type defaults to bytearray.If you assign a name to a field, you can refer to that field using the name or by positional notation. If you don't assign a name to a field (the field is un-named) you can only refer to the field using positional notation.If you assign a type to a field, you can subsequently change the type using the cast operators. If you don't assign a type to a field, the field defaults to bytearray; you can change the default type using the cast operators.
Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in five languages: Java, Python, JavaScript, Ruby and Groovy.The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported.Limited support is provided for Python, JavaScript, Ruby and Groovy functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript, Ruby and Groovy are provided as experimental features because they did not go through the same amount of testing as Java or Python. At runtime note that Pig will automatically detect the usage of a scripting UDF in the Pig script and will automatically ship the corresponding scripting jar, either Jython, Rhino, JRuby or Groovy-all, to the backend.Pig also provides support for Piggy Bank, a repository for JAVA UDFs. Through Piggy Bank you can access Java UDFs written by other users and contribute Java UDFs that you have written.Eval is the most common type of function. It can be used in FOREACH statements for whatever purpose.public String exec(Tuple input)The load/store UDFs control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case.The Pig load/store API is aligned with Hadoop'sInputFormat and OutputFormat classes.The LoadFunc abstract class is the main class to extend for implementing a loader. The methods which need to be overridden are explained below:getInputFormat(): This method is called by Pig to get the InputFormat used by the loader. The methods in the InputFormat (and underlying RecordReader) are called by Pig in the same manner (and in the same context) as by Hadoop in a MapReduce java program. If the InputFormat is a Hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be implemented using the new API in org.apache.hadoop.mapreduce.If a custom loader using a text-based InputFormat or a file-based InputFormat would like to read files in all subdirectories under a given input directory recursively, then it should use the PigTextInputFormat and PigFileInputFormat classes provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. The Pig InputFormat classes work around a current limitation in the HadoopTextInputFormat and FileInputFormat classes which only read one level down from the provided input directory. For example, if the input in the load statement is 'dir1' and there are subdirs 'dir2' and 'dir2/dir3' beneath dir1, the HadoopTextInputFormat and FileInputFormat classes read the files under 'dir1' only. Using PigTextInputFormat or PigFileInputFormat (or by extending them), the files in all the directories can be read.setLocation(): This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying InputFormat. This method is called multiple times by pig - implementations should bear this in mind and should ensure there are no inconsistent side effects due to the multiple calls.prepareToRead(): Through this method the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc. The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig.getNext(): The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in this method the implementation should use the underlying RecordReader and construct the tuple to return.StoreFunc abstract class has the main methods for storing data and for most use cases it should suffice to extend it.The methods which need to be overridden in StoreFunc are explained below:getOutputFormat(): This method will be called by Pig to get the OutputFormat used by the storer. The methods in the OutputFormat (and underlying RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the OutputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects.setStoreLocation(): This method is called by Pig to communicate the store location to the storer. The storer should use this method to communicate the same information to the underlying OutputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls.prepareToWrite(): In the new API, writing of the data is through the OutputFormat provided by the StoreFunc. In prepareToWrite() the RecordWriter associated with the OutputFormat provided by the StoreFunc is passed to the StoreFunc. The RecordWriter can then be used by the implementation in putNext() to write a tuple representing a record of data in a manner expected by the RecordWriter.putNext(): The meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the underlying RecordWriter to write the Tuple out.