HFile is a mimic of Google’s SSTable. Now, it is available in Hadoop HBase-0.20.0. And the previous releases of HBase temporarily use an alternate file format – MapFile, which is a common file format in Hadoop IO package. I think HFile should also become a common file format when it becomes mature, and should be moved into the common IO package of Hadoop in the future.
Introduction to PKI & SafeNet Luna Hardware Security Modules with Microsoft W...SafeNet
To aid a successful and secure Public Key Infrastructure (PKI) implementation, this article
examines the essential concepts, technology, components, and operations associated with
deploying a Microsoft PKI with root key protection performed by a SafeNet Luna Hardware
Security Module (HSM).
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and use work load management.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Who Should Attend:
• Data Warehouse Developers, Big Data Architects, BI Managers, and Data Engineers
Introduction to PKI & SafeNet Luna Hardware Security Modules with Microsoft W...SafeNet
To aid a successful and secure Public Key Infrastructure (PKI) implementation, this article
examines the essential concepts, technology, components, and operations associated with
deploying a Microsoft PKI with root key protection performed by a SafeNet Luna Hardware
Security Module (HSM).
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and use work load management.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Who Should Attend:
• Data Warehouse Developers, Big Data Architects, BI Managers, and Data Engineers
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
( Hadoop Training: https://www.edureka.co/hadoop )
This Edureka "What is Hadoop" tutorial ( Hadoop Blog series: https://goo.gl/LFesy8 ) helps you to understand how Big Data emerged as a problem and how Hadoop solved that problem. This tutorial will be discussing about Hadoop Architecture, HDFS & it's architecture, YARN and MapReduce in detail. Below are the topics covered in this tutorial:
1) 5 V’s of Big Data
2) Problems with Big Data
3) Hadoop-as-a solution
4) What is Hadoop?
5) HDFS
6) YARN
7) MapReduce
8) Hadoop Ecosystem
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
At Salesforce, we have deployed many thousands of HBase/HDFS servers, and learned a lot about tuning during this process. This talk will walk you through the many relevant HBase, HDFS, Apache ZooKeeper, Java/GC, and Operating System configuration options and provides guidelines about which options to use in what situation, and how they relate to each other.
Short overview of AAA and the RADIUS protocol.
The term AAA (say triple A) subsumes the functions used in network access to allow a user or a computer to access a network and use its resources.
AAA stands for Authentication (is the user authentic?), Authorization (what is the user allowed to do?) and Accounting (track resource usage by the user).
AAA is typically employed at network ingress points to control user's access to the network and resources.
The most prominent protocol for AAA is RADIUS (Remote Authentication Dial In User Service) which defines messages for opening and closing a network session and counting network usage (packet and byte count).
RADIUS usually works in conjunction with an LDAP server that stores the policies and user authorizations in a central repository.
Java DataBase Connectivity API (JDBC API)Luzan Baral
JDBC is a Java-based data access technology (Java Standard Edition platform) from Oracle Corporation. This technology is an API for the Java programming language that defines how a client may access a database. It provides methods for querying and updating data in a database. JDBC is oriented towards relational databases. A JDBC-to-ODBC bridge enables connections to any ODBC-accessible data source in the JVM host environment.
Light Weight Transactions Under Stress (Christopher Batey, The Last Pickle) ...DataStax
The Strong Consistency provided by QUORUM reads in Cassandra can still lead to read-write-modify problems when applications want to do things such as guarantee uniqueness or sell exactly 300 cinema tickets. Fortunately Light Weight Transactions (LWT) are designed to solve the problems Strong Consistency can not.
In this talk Christopher Batey, Consultant at The Last Pickle, will discuss:
- Syntax and semantics: Theoretical use cases
- How they work under the covers
Then we will go through LWTs in practice:
- How do the number of nodes/replicas/data centres affect performance?
- How does contention (multiple concurrent queries using LWTs) affect availability and performance?
- What consistency guarantees do you get with other LWTs and non-LWTs?
- How does LWT timeout differ from normal write timeout?
- Use case: LWTs as a distributed lock and how it went wrong 5 times.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
Active Directory is a common interface for organizing and maintaining information related to resources connected to a variety of network directories.
Lightweight Directory Access Protocol (LDAP) is an Internet protocol used to access information directories.
A directory service is a distributed database application designed to manage the entries and attributes in a directory.
Amazon Relational Database Service (Amazon RDS) is a web service that makes it easy to set up, operate, and scale a relational database in the cloud. With Amazon RDS, you can MySQL in minutes with cost-efficient and re-sizable hardware capacity. In this webinar, we'll discuss how to get the most out of the service, including techniques for migrating data in and out.
Garbage collection has largely removed the need to think about memory management when you write Java code, but there is still a benefit to understanding and minimizing the memory usage of your applications, particularly with the growing number of deployments of Java on embedded devices. This session gives you insight into the memory used as you write Java code and provides you with guidance on steps you can take to minimize your memory usage and write more-memory-efficient code. It shows you how to
• Understand the memory usage of Java code
• Minimize the creation of new Java objects
• Use the right Java collections in your application
• Identify inefficiencies in your code and remove them
Video available from Parleys.com:
https://www.parleys.com/talk/how-write-memory-efficient-java-code
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
( Hadoop Training: https://www.edureka.co/hadoop )
This Edureka "What is Hadoop" tutorial ( Hadoop Blog series: https://goo.gl/LFesy8 ) helps you to understand how Big Data emerged as a problem and how Hadoop solved that problem. This tutorial will be discussing about Hadoop Architecture, HDFS & it's architecture, YARN and MapReduce in detail. Below are the topics covered in this tutorial:
1) 5 V’s of Big Data
2) Problems with Big Data
3) Hadoop-as-a solution
4) What is Hadoop?
5) HDFS
6) YARN
7) MapReduce
8) Hadoop Ecosystem
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
At Salesforce, we have deployed many thousands of HBase/HDFS servers, and learned a lot about tuning during this process. This talk will walk you through the many relevant HBase, HDFS, Apache ZooKeeper, Java/GC, and Operating System configuration options and provides guidelines about which options to use in what situation, and how they relate to each other.
Short overview of AAA and the RADIUS protocol.
The term AAA (say triple A) subsumes the functions used in network access to allow a user or a computer to access a network and use its resources.
AAA stands for Authentication (is the user authentic?), Authorization (what is the user allowed to do?) and Accounting (track resource usage by the user).
AAA is typically employed at network ingress points to control user's access to the network and resources.
The most prominent protocol for AAA is RADIUS (Remote Authentication Dial In User Service) which defines messages for opening and closing a network session and counting network usage (packet and byte count).
RADIUS usually works in conjunction with an LDAP server that stores the policies and user authorizations in a central repository.
Java DataBase Connectivity API (JDBC API)Luzan Baral
JDBC is a Java-based data access technology (Java Standard Edition platform) from Oracle Corporation. This technology is an API for the Java programming language that defines how a client may access a database. It provides methods for querying and updating data in a database. JDBC is oriented towards relational databases. A JDBC-to-ODBC bridge enables connections to any ODBC-accessible data source in the JVM host environment.
Light Weight Transactions Under Stress (Christopher Batey, The Last Pickle) ...DataStax
The Strong Consistency provided by QUORUM reads in Cassandra can still lead to read-write-modify problems when applications want to do things such as guarantee uniqueness or sell exactly 300 cinema tickets. Fortunately Light Weight Transactions (LWT) are designed to solve the problems Strong Consistency can not.
In this talk Christopher Batey, Consultant at The Last Pickle, will discuss:
- Syntax and semantics: Theoretical use cases
- How they work under the covers
Then we will go through LWTs in practice:
- How do the number of nodes/replicas/data centres affect performance?
- How does contention (multiple concurrent queries using LWTs) affect availability and performance?
- What consistency guarantees do you get with other LWTs and non-LWTs?
- How does LWT timeout differ from normal write timeout?
- Use case: LWTs as a distributed lock and how it went wrong 5 times.
About the Speaker
Christopher Batey Consultant / Software Engineer, The Last Pickle
Christopher (@chbatey) is a part time consultant at The Last Pickle where he works with clients to help them succeed with Apache Cassandra as well as a freelance software engineer working in London. Likes: Scala, Haskell, Java, the JVM, Akka, distributed databases, XP, TDD, Pairing. Hates: Untested software, code ownership. You can checkout his blog at: http://www.batey.info
Active Directory is a common interface for organizing and maintaining information related to resources connected to a variety of network directories.
Lightweight Directory Access Protocol (LDAP) is an Internet protocol used to access information directories.
A directory service is a distributed database application designed to manage the entries and attributes in a directory.
Amazon Relational Database Service (Amazon RDS) is a web service that makes it easy to set up, operate, and scale a relational database in the cloud. With Amazon RDS, you can MySQL in minutes with cost-efficient and re-sizable hardware capacity. In this webinar, we'll discuss how to get the most out of the service, including techniques for migrating data in and out.
Garbage collection has largely removed the need to think about memory management when you write Java code, but there is still a benefit to understanding and minimizing the memory usage of your applications, particularly with the growing number of deployments of Java on embedded devices. This session gives you insight into the memory used as you write Java code and provides you with guidance on steps you can take to minimize your memory usage and write more-memory-efficient code. It shows you how to
• Understand the memory usage of Java code
• Minimize the creation of new Java objects
• Use the right Java collections in your application
• Identify inefficiencies in your code and remove them
Video available from Parleys.com:
https://www.parleys.com/talk/how-write-memory-efficient-java-code
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
Maximizing performance in data engineering is a daunting challenge. We present some of our work on designing faster indexes, with a particular emphasis on compressed indexes. Some of our prior work includes (1) Roaring indexes which are part of multiple big-data systems such as Spark, Hive, Druid, Atlas, Pinot, Kylin, (2) EWAH indexes are part of Git (GitHub) and included in major Linux distributions.
We will present ongoing and future work on how we can process data faster while supporting the diverse systems found in the cloud (with upcoming ARM processors) and under multiple programming languages (e.g., Java, C++, Go, Python). We seek to minimize shared resources (e.g., RAM) while exploiting algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. Our end goal is to process billions of records per second per core.
The talk will be aimed at programmers who want to better understand the performance characteristics of current big-data systems as well as their evolution. The following specific topics will be addressed:
1. The various types of indexes and their performance characteristics and trade-offs: hashing, sorted arrays, bitsets and so forth.
2. Index and table compression techniques: binary packing, patched coding, dictionary coding, frame-of-reference.
Configuring Mahout Clustering Jobs - Frank Scholtenlucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...Zilliz
People often say that vector search is easy, but that's not entirely true. Vector search is more than just vector indexing and a Python wrapper. If you want to build a high-performance, scalable, and production-ready vector search service, you need to consider many factors.
This presentation explain about "Apache Cassandra's concepts and architecture".
My friends and colleagues said
"This presentation should be release on public space to help many peoples work in IT"
so, I upload this file for everyone love "Technology for the people"
This presentation used for educating the employee of KT last year.
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...buildacloud
About Basho: Basho makes and distributes Riak CS. Built on Riak, Basho's opensource, scalable datastore used by thousands in production, CS is made for companies that need large file storage that can't go down.
About the speaker: Andy Gross, Basho's Chief Architect, will take you on a tour of RiakCS, talk about how and why Basho built it, and the architecture that underpins it. He'll also highlight various uses case featuring Fortune500 companies who rely on Riak CS.
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
This presentation covers our use of Storm and the connectors we've built. It also proposes a design for integrating Storm with real-time web services by embedding parts of topologies directly into the web services layer.
PERFORMING AN EXPERIMENTAL PLATFORM TO OPTIMIZE DATA MULTIPLEXINGijesajournal
This article is based on preliminary work on the OSI model management layers to optimized industrial
wired data transfer on low data rate wireless technology. Our previous contribution deal with the
development of a demonstrator providing CAN bus transfer frames (1Mbps) on a low rate wireless channel
provided by Zigbee technology. In order to be compatible with all the other industrial protocols, we
describe in this paper our contribution to design an innovative Wireless Device (WD) and a software tool,
which will aim to determine the best architecture (hardware/software) and wireless technology to be used
taking in account of the wired protocol requirements. To validate the proper functioning of this WD, we
will develop an experimental platform to test different strategies provided by our software tool. We can
consequently prove which is the best configuration (hardware/software) compared to the others by the
inclusion (inputs) of the required parameters of the wired protocol (load, binary rate, acknowledge
timeout) and the analysis of the WD architecture characteristics proposed (outputs) as the delay introduced
by system, buffer size needed, CPU speed, power consumption, meeting the input requirement. It will be
important to know whether gain comes from a hardware strategy with hardware accelerator e.g or a
software strategy with a more perf
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the Author:
Currently Vadim is a Senior Machine Learning Engineer at source{d} where he works on deep neural networks that aim to understand all of the world's developers through their code. Vadim is one of the creators of the distributed deep learning platform Veles (https://velesnet.ml) while working at Samsung. Afterwards Vadim was responsible for the machine learning efforts to fight email spam at Mail.Ru. In the past Vadim was also a visiting associate professor at Moscow Institute of Physics and Technology, teaching about new technologies and conducting ACM-like internal coding competitions. Vadim is also a big fan of GitHub (vmarkovtsev) and HackerRank (markhor), as well as likes to write technical articles on a number of web sites.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Leading Change strategies and insights for effective change management pdf 1.pdf
HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
1. HFile: A Block-Indexed File Format
to Store Sorted Key-Value Pairs
Sep 10, 2009 Schubert Zhang (schubert.zhang@gmail.com) http://cloudepr.blogspot.com
1. Introduction
HFile is a mimic of Google’s SSTable. Now, it is available in Hadoop
HBase-0.20.0. And the previous releases of HBase temporarily use an
alternate file format – MapFile[4], which is a common file format in Hadoop
IO package. I think HFile should also become a common file format when
it becomes mature, and should be moved into the common IO package of Hadoop
in the future.
Following words of SSTable are from section 4 of Google’s Bigtable paper.
The Google SSTable file format is used internally to store Bigtable data.
An SSTable provides a persistent, ordered immutable map from keys to values,
where both keys and values are arbitrary byte strings. Operations are
provided to look up the value associated with a specified key, and to
iterate over all key/value pairs in a specified key range. Internally,
each SSTable contains a sequence of blocks (typically each block is 64KB
in size, but this is configurable). A block index (stored at the end of
the SSTable) is used to locate blocks; the index is loaded into memory
when the SSTable is opened. A lookup can be performed with a single disk
seek: we first find the appropriate block by performing a binary search
in the in-memory index, and then reading the appropriate block from disk.
Optionally, an SSTable can be completely mapped into memory, which allows
us to perform lookups and scans without touching disk.[1]
The HFile implements the same features as SSTable, but may provide more
or less.
2. File Format
Data Block Size
Whenever we say Block Size, it means the uncompressed size.
The size of each data block is 64KB by default, and is configurable in
HFile.Writer. It means the data block will not exceed this size more than
one key/value pair. The HFile.Writer starts a new data block to add
key/value pairs if the current writing block is equal to or bigger than
this size. The 64KB size is same as Google’s [1].
1
2. To achieve better performance, we should select different block size. If
the average key/value size is very short (e.g. 100 bytes), we should select
small blocks (e.g. 16KB) to avoid too many key/value pairs in each block,
which will increase the latency of in-block seek, because the seeking
operation always finds the key from the first key/value pair in sequence
within a block.
Maximum Key Length
The key of each key/value pair is currently up to 64KB in size. Usually,
10-100 bytes is a typical size for most of our applications. Even in the
data model of HBase, the key (rowkey+column family:qualifier+timestamp)
should not be too long.
Maximum File Size
The trailer, file-info and total data block indexes (optionally, may add
meta block indexes) will be in memory when writing and reading of an HFile.
So, a larger HFile (with more data blocks) requires more memory. For example,
a 1GB uncompressed HFile would have about 15600 (1GB/64KB) data blocks,
and correspondingly about 15600 indexes. Suppose the average key size is
64 bytes, then we need about 1.2MB RAM (15600X80) to hold these indexes
in memory.
Compression Algorithm
- Compression reduces the number of bytes written to/read from HDFS.
- Compression effectively improves the efficiency of network bandwidth
and disk space
- Compression reduces the size of data needed to be read when issuing
a read
To be as low friction as necessary, a real-time compression library is
preferred. Currently, HFile supports following three algorithms:
(1) NONE (Default, uncompressed, string name=”none”)
(2) GZ (Gzip, string name=”gz”)
Out of the box, HFile ships with only Gzip compression, which is fairly
slow.
(3) LZO(Lempel-Ziv-Oberhumer, preferred, string name=”lzo”)
To achieve maximal performance and benefit, you must enable LZO, which
is a lossless data compression algorithm that is focused on
decompression speed.
Following figures show the format of an HFile.
2
3. KeyLen (int) ValLen (int) Key (byte[]) Value (byte[])
Data Block 0
DATA BLOCK MAGIC (8B)
Key-Value (First)
Data Block 1
……
Key-Value (Last)
Data Block 2
KeyLen Key id ValLen Val
(vint) (byte[]) (1B) (vint) (byte[])
Meta Block 0
(Optional) User Defined Metadata,
start with METABLOCKMAGIC
Meta Block 1 Size or ItemsNum (int)
(Optional)
LASTKEY (byte[])
AVG_KEY_LEN (int)
File Info
AVG_VALUE_LEN (int)
COMPARATOR (className)
User Defined
Data Index
INDEX BLOCK MAGIC (8B)
Meta Index Index of Data Block 0
(Optional)
…
Trailer INDEX BLOCK MAGIC (8B)
Index of Meta Block 0
…
Fixed File Trailer
(Go to next picture)
Offset(long) MetaSize (int) MetaNameLen (vint) MetaName (byte[])
Offset(long) DataSize (int) KeyLen (vint) Key (byte[])
3
4. TRAILER BLOCK MAGIC (8B)
File Info Offset (long)
Trailer Data Index Offset (long)
Data Index Count (int)
Meta Index Offset (long)
Meta Index Count (int)
Total Uncompressed Data Bytes (long)
Entry Count or Data K-V Count (int)
Compression Codec (int)
Version (int)
Total Size of Trailer: 4xLong + 5xInt + 8Bytes = 60 Bytes
In above figures, an HFile is separated into multiple segments, from
beginning to end, they are:
- Data Block segment
To store key/value pairs, may be compressed.
- Meta Block segment (Optional)
To store user defined large metadata, may be compressed.
- File Info segment
It is a small metadata of the HFile, without compression. User can add
user defined small metadata (name/value) here.
- Data Block Index segment
Indexes the data block offset in the HFile. The key of each index is
the key of first key/value pair in the block.
- Meta Block Index segment (Optional)
Indexes the meta block offset in the HFile. The key of each index is
the user defined unique name of the meta block.
- Trailer
The fix sized metadata. To hold the offset of each segment, etc. To
read an HFile, we should always read the Trailer firstly.
The current implementation of HFile does not include Bloom Filter, which
should be added in the future.
The FileInfo is a SortedMap in implementation. So the actual order of those
4
5. fields is alphabetically based on the key.
3. LZO Compression
LZO is now removed from Hadoop or HBase 0.20+ because of GPL restrictions.
To enable it, we should install native library firstly as following.
[6][7][8][9]
(1) Download LZO: http://www.oberhumer.com/, and build.
# ./configure --build=x86_64-redhat-linux-gnu --enable-shared
--disable-asm
# make
# make install
Then the libraries have been installed in: /usr/local/lib
(2) Download the native connector library
http://code.google.com/p/hadoop-gpl-compression/, and build.
Copy hadoo-0.20.0-core.jar to ./lib.
# ant compile-native
# ant jar
(3) Copy the native library (build/native/ Linux-amd64-64) and
hadoop-gpl-compression-0.1.0-dev.jar to your application’s lib
directory. If your application is a MapReduce job, copy them to
hadoop’s lib directory. Your application should follow the
$HADOOP_HOME/bin/hadoop script to ensure that the native hadoop
library is on the library path via the system property
-Djava.library.path=<path>. [9] For example:
# setup 'java.library.path' for native-hadoop code if necessary
JAVA_LIBRARY_PATH=''
if [ -d "${HADOOP_HOME}/build/native" -o -d "${HADOOP_HOME}/lib/native" ]; then
JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m
org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`
if [ -d "$HADOOP_HOME/build/native" ]; then
JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib
fi
if [ -d "${HADOOP_HOME}/lib/native" ]; then
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}
else
JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}
fi
fi
fi
Then our application and hadoop/MapReduce can use LZO.
5
6. 4. Performance Evaluation
Testbed
− 4 slaves + 1 master
− Machine: 4 CPU cores (2.0G), 2x500GB 7200RPM SATA disks, 8GB RAM.
− Linux: RedHat 5.1 (2.6.18-53.el5), ext3, no RAID, noatime
− 1Gbps network, all nodes under the same switch.
− Hadoop-0.20.0 (1GB heap), lzo-2.0.3
Some MapReduce-based benchmarks are designed to evaluate the performance
of operations to HFiles, in parallel.
− Total key/value entries: 30,000,000.
− Key/Value size: 1000 bytes (10 for key, and 990 for value). We have
totally 30GB of data.
− Sequential key ranges: 60, i.e. each range have 500,000 entries.
− Use default block size.
− The entry value is a string, each continuous 8 bytes are a filled
with a same letter (A~Z). E.g. “BBBBBBBBXXXXXXXXGGGGGGGG……”.
We set mapred.tasktracker.map.tasks.maximum=3 to avoid client side
bottleneck.
(1) Write
Each MapTask for each range of key, which writes a separate HFile with
500,000 key/value entries.
(2) Full Scan
Each MapTask scans a separate HFile from beginning to end.
(3) Random Seek a specified key
Each MapTask opens one separate HFile, and selects a random key within
that file to seek it. Each MapTask runs 50,000 (1/10 of the entries)
random seeks.
(4) Random Short Scan
Each MapTask opens one separate HFile, and selects a random key within
that file as a beginning to scan 30 entries. Each MapTask runs 50,000
scans, i.e. scans 50,000*30=1,500,000 entries.
This table shows the average entries which are written/seek/scanned per
second, and per node.
Compress none gz lzo SequenceFile
Operation (none compress)
Write 20718 23885 55147 19789
Full Scan 41436 94937 100000 28626
Random Seek 600 989 956 N/A
Random Short Scan 12241 25568 25655 N/A
6
7. In this evaluation case, the compression ratio is about 7:1 for gz(Gzip),
and about 4:1 for lzo. Even through the compression ratio is just moderate,
the lzo column shows the best performance, especially for writes.
The performance of full scan is much better than SequenceFile, so HFile
may provide better performance to MapReduce-based analytical
applications.
The random seek in HFiles is slow, especially in none-compressed HFiles.
But the above numbers already show 6X~10X better performance than a disk
seek (10ms). Following Ganglia charts show us the overhead of load, CPU,
and network. The random short scan makes the similar phenomena.
5. Implementation and API
5.1 HFile.Writer : How to create and write an HFile
(1) Constructors
There are 5 constructors. We suggest using following two:
public Writer(FileSystem fs, Path path, int blocksize,
String compress,
final RawComparator<byte []> comparator)
public Writer(FileSystem fs, Path path, int blocksize,
Compression.Algorithm compress,
final RawComparator<byte []> comparator)
These two constructors are same. They create file (call fs.create(…)) and
get an FSDataOutputStream for writing. Since the FSDataOutputStream is
7
8. created when constructing the HFile.Writer, it will be automatically
closed when the HFile.Writer is closed.
The other two constructors provide FSDataOutputStream as a parameter. It
means the file is created and opened outside of the HFile.Writer, so, when
we close the HFile.Writer, the FSDataOutputStream will not be closed. But
we do not suggest using these two constructors directly.
public Writer(final FSDataOutputStream ostream, final int blocksize,
final String compress,
final RawComparator<byte []> c)
public Writer(final FSDataOutputStream ostream, final int blocksize,
final Compression.Algorithm compress,
final RawComparator<byte []> c)
Another constructor only provides fs and path as parameters, all other
attributes are default, i.e. NONE of compression, 64KB of block size, raw
ByteArrayComparator, etc.
(2) Write Key/Value pairs into HFile
Before key/value pairs are written into an HFile, the application must
sort them using the same comparator, i.e. all key/value pairs must be
sequentially and increasingly write/append into an HFile. There are
following methods to write/append key/value pairs:
public void append(final KeyValue kv)
public void append(final byte [] key, final byte [] value)
public void append(final byte [] key, final int koffset, final int klength,
final byte [] value, final int voffset, final int vlength)
When adding a key/value pair, they will check the current block size. If
the size reach the maximum size of a block, the current block will be
compressed and written to the output stream (of the HFile), and then create
a new block for writing. The compression is based on each block. For each
block, an output stream for compression will be created from beginning
of a new block and released when finish.
Following chart is the relationship of the output steams OO design:
8
9. DataOutputStream
BufferedOutputStream
FinishOnFlushCompressionStream
Compression?OutputStream
(for different codec)
FSDataOutputStream
(to an HFile)
The key/value appending operation is written from the outside
(DataOutputStream), and the above OO mechanism will handle the buffer and
compression functions and then write to the file in under layer file system.
Before a key/value pair is written, following will checked:
- The length of Key
- The order of Key (must bigger than the last one)
(3) Add metadata into HFile
We can add metadata block into an HFile.
public void appendMetaBlock(String metaBlockName, byte [] bytes)
The application should provide a unique metaBlockName for each metadata
block within an HFile.
Reminding: If your metadata is large enough (e.g. 32KB uncompressed), you
can use this feature to add a separate meta block. It may be compressed
in the file.
But if your metadata is very small (e.g. less than 1KB), please use
following method to append it into file info. File info will not be
compressed.
public void appendFileInfo(final byte [] k, final byte [] v)
(4) Close
Before the HFile.Writer is closed, the file is not completed written. So,
we must call close() to:
- finish and flush the last block
9
10. - write all meta block into file (may be compressed)
- generate and write file info metadata
- write data block indexes
- write meta block indexes
- generate and write trailer metadata
- close the output-stream.
5.2 HFile.Reader: How to read HFile
Create an HFile.Reader to open an HFile, and we can seek, scan and read
on it.
(1) Constructor
We suggest using following constructor to create an HFile.Reader.
public Reader(FileSystem fs, Path path, BlockCache cache,
boolean inMemory)
It calls fs.open(…) to open the file, and gets an FSDataInputStream
for reading. The input stream will be automatically closed when the
HFile.Reader is closed.
Another constructor uses InputStream as parameter directly. It means
the file is opened outside the HFile.Reader.
public Reader(final FSDataInputStream fsdis, final long size,
final BlockCache cache, final boolean inMemory)
We can use BlockCache to improve the performance of read, and the mechanism
of mechanism will be described in other document.
(2) Load metadata and block indexes of an HFile
The HFile is not readable before loadFileInfo() is explicitly called .
It will read metadata (Trailer, File Info) and Block Indexes (data
block and meta block) into memory. And the COMPARATOR instance will
reconstruct from file info.
BlockIndex
The important method of BlockIndex is:
int blockContainingKey(final byte[] key, int offset, int length)
10
11. It uses binarySearch to check if a key is contained in a block. The
return value of binarySearch() is very puzzled:
Data Block Index List Before 0 1 2 3 4 5 6 …
binarySearch() return -1 -2 -3 -4 -5 -6 -7 -8 …
HFileScanner
We must create an HFile.Reader.Scanner to seek, scan, and read on an
HFile. HFile.Reader.Scanner is an implementation of HFileScanner
interface.
To seek and scan in an HFIle, we should do as following:
(1) Create a HFile.Reader, and loadFileInfo().
(2) In this HFile.Reader, calls getScanner() to obtain an HFileScanner.
(3) .1 For a scan from the beginning of the HFile, calls seekTo() to
seek to the beginning of the first block.
.2 For a scan from a key, calls seekTo(key) to seek to the position
of the key or before the key (if there is not such a key in this
HFile).
.3 For a scan from before of a key, calls seekBefore(key).
(4) Calls next() to iterate over all key/value pairs. The next() will
return false when it reach the end of the HFile. If an application
wants to stop at any condition, it should be implemented by the
application itself. (e.g. stop at a special endKey.)
(5) If you want lookup a specified key, just call seekTo(key), the
returned value=0 means you found it.
(6) After we seekTo(…) or next() to a position of specified key, we
can call following methods to get the current key and value.
public KeyValue getKeyValue() // recommended
public ByteBuffer getKey()
public ByteBuffer getValue()
(7) Don’t forget to close the HFile.Reader. But a scanner need not be
closed, since it does not hold any resource.
11
12. References
[1] Google, Bigtable: A Distributed Storage System for Structured Data,
http://labs.google.com/papers/bigtable.html
[2] HBase-0.20.0 Documentation,
http://hadoop.apache.org/hbase/docs/r0.20.0/
[3] HFile code review and refinement.
http://issues.apache.org/jira/browse/HBASE-1818
[4] MapFile API:
http://hadoop.apache.org/common/docs/current/api/org/apache
/hadoop/io/MapFile.html
[5] Parallel LZO: Splittable Compression for Hadoop.
http://www.cloudera.com/blog/2009/06/24/parallel-lzo-splitt
able-compression-for-hadoop/
http://blog.chrisgoffinet.com/2009/06/parallel-lzo-splittab
le-on-hadoop-using-cloudera/
[6] Using LZO in Hadoop and HBase:
http://wiki.apache.org/hadoop/UsingLzoCompression
[7] LZO: http://www.oberhumer.com
[8] Hadoop LZO native connector library:
http://code.google.com/p/hadoop-gpl-compression/
[9] Hadoop Native Libraries Guide:
http://hadoop.apache.org/common/docs/r0.20.0/native_librari
es.html
12