This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
Apache Flink is an open source project that offers both batch and stream processing on top of a common runtime and exposing a common API. This talk focuses on the stream processing capabilities of Flink.
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
Apache Flink is an open source project that offers both batch and stream processing on top of a common runtime and exposing a common API. This talk focuses on the stream processing capabilities of Flink.
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Streaming is necessary to handle IoT data rates and latency but SQL is unquestionably the lingua franca of data. Apache Samza and Apache Storm have new high-level query interfaces based on standard SQL with streaming extensions, both powered by Apache Calcite. Calcite's relational algebra allows query optimization and federation with data-at-rest in databases, memory, or HDFS.
A talk given by Julian Hyde at Hadoop Summit, San Jose, on 2016/06/29.
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
An in-depth look at Apache Flink’s Streaming Dataflow Engine. Flink executes data streaming programs directly as streams with low latency and flexible user-defined state and models batch programs as streaming programs on finite data streams.
The slides cover the general design of the runtime and show how the engine is able to support diverse features and workloads without compromising on performance or usability.
Flink Forward, Berlin
October 13, 2015
(SPOT302) Availability: The New Kind of Innovator’s DilemmaAmazon Web Services
Successful companies, while focusing on their current customers' needs, often fail to embrace disruptive technologies and business models. This phenomenon, known as the "Innovator's Dilemma," eventually leads to many companies' downfall and is especially relevant in the fast-paced world of online services. In order to protect its leading position and grow its share of the highly competitive global digital streaming market, Netflix has to continuously increase the pace of innovation by constantly refining recommendation algorithms and adding new product features, while maintaining a high level of service uptime. The Netflix streaming platform consists of hundreds of microservices that are constantly evolving, and even the smallest production change may cause a cascading failure that can bring the entire service down. We face a new kind of Innovator's Dilemma, where product changes may not only disrupt the business model but also cause production outages that deny customers service access. This talk will describe various architectural, operational and organizational changes adopted by Netflix in order to reconcile rapid innovation with service availability.
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Streaming is necessary to handle IoT data rates and latency but SQL is unquestionably the lingua franca of data. Apache Samza and Apache Storm have new high-level query interfaces based on standard SQL with streaming extensions, both powered by Apache Calcite. Calcite's relational algebra allows query optimization and federation with data-at-rest in databases, memory, or HDFS.
A talk given by Julian Hyde at Hadoop Summit, San Jose, on 2016/06/29.
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
An in-depth look at Apache Flink’s Streaming Dataflow Engine. Flink executes data streaming programs directly as streams with low latency and flexible user-defined state and models batch programs as streaming programs on finite data streams.
The slides cover the general design of the runtime and show how the engine is able to support diverse features and workloads without compromising on performance or usability.
Flink Forward, Berlin
October 13, 2015
(SPOT302) Availability: The New Kind of Innovator’s DilemmaAmazon Web Services
Successful companies, while focusing on their current customers' needs, often fail to embrace disruptive technologies and business models. This phenomenon, known as the "Innovator's Dilemma," eventually leads to many companies' downfall and is especially relevant in the fast-paced world of online services. In order to protect its leading position and grow its share of the highly competitive global digital streaming market, Netflix has to continuously increase the pace of innovation by constantly refining recommendation algorithms and adding new product features, while maintaining a high level of service uptime. The Netflix streaming platform consists of hundreds of microservices that are constantly evolving, and even the smallest production change may cause a cascading failure that can bring the entire service down. We face a new kind of Innovator's Dilemma, where product changes may not only disrupt the business model but also cause production outages that deny customers service access. This talk will describe various architectural, operational and organizational changes adopted by Netflix in order to reconcile rapid innovation with service availability.
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Coburn Watson
Surge 2013 presentation which covers how Netflix maximizes engineering velocity while keeping risks to scalability, reliability, and performance in check.
From Code to the Monkeys: Continuous Delivery at NetflixDianne Marsh
At Netflix, we continue to improve upon our continuous delivery process. We thrive in a hybrid environment, where every developer is able to deploy code, and with that freedom comes the responsibility for ensuring that our customers are not negatively impacted. We have constructed Open Source tools toward a Continuous Delivery solution. In this presentation, from QConSF 2013, you will learn about our tool chain so that you can determine which make sense in your environment.
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
Justin and I gave this talk in QCon SF 2014 about the Mantis, a stream processing system that features a reactive programming API, auto scaling, and stream locality
With ad growth thrown into the mix, it’s apparent that every facet of the OTT market is expanding: advertising opportunities; popularity of OTT devices like Apple TV and Roku; and the amount of OTT content and services geared to break into the market.
Continuous Delivery at Netflix, and beyondMike McGarr
A talk I gave on how Netflix delivers code to production, some of the enabling factors and recommendations for how to implement continuous delivery in your organization.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
Challenges of Building a First Class SQL-on-Hadoop Engine:
Why and what is Big SQL 3.0?
Overview of the challenges
How we solved (some of) them
Architecture and interaction with Hadoop
Query rewrite
Query optimization
Future challenges
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
Presentations from the Cloudera Impala meetup on Aug 20 2013:
- Nong Li on Parquet+Impala and UDF support
- Henry Robinson on performance tuning for Impala
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
Similar to Hadoop and HBase experiences in perf log project (20)
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
11. 11
Yahoo! Cloud Serving Benchmark
• 3 HBase nodes on Solaris zones
Throughput
Average
Response
Time
Max
Response
Time
Write 1808 writes/s 1.6 ms
0.02% > 1s
(due to region
splitting)
Read 9846 reads/s 0.3 ms 45ms
13. 13
Setting up Hadoop
• Supported Platforms
• Linux – best
• Solaris – ok. Just works
• Windows – not recommend
• Required Software
• JDK 1.6.x
• SSH
• Packages
• Cloudera
14. 14
Match Hadoop & HBase Version
Hadoop version HBase version Compatible?
0.20.3 release 0.90.x NO
0.20-append 0.90.x YES
0.20.5 release 0.90.x YES
0.21.0 release 0.90.x NO
0.22.x (in
development)
0.90.x NO
15. 15
Running Modes of Hadoop
• Standalone Operation
By default, run in a non-distributed mode, as a single Java process, be useful for
debugging.
• Pseudo-Distributed Operation
Run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs
in a separate Java process.
• Fully-Distributed Operation
Run in a cluster, the real production environment.
18. 18
Map Reduce Job
MapReduce is a programming model for data processing on Hadoop. It works by breaking the
processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs
as input and output, the types of which may be chosen by the programmer.
• Mapper
A Mapper usually process data in single lines. Ignore the useless lines and collect useful information
from data into <Key, Value> pairs.
• Reducer
Receive the <Key, <Value1, Value2, …>> pairs from Mappers. Tabulate statistics data and write the
results into <Key, Value> pairs.
20. 20
Serialization in Hadoop
int IntWritable
long LongWritable
boolean BooleanWritable
byte ByteWritable
float FloatWritable
double DoubleWritable
String Text
null NullWritable
21. 21
Example: WordCount
Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they
work. WordCount is a simple application that counts the number of occurences of each word in a given input set.
public static class Map extends MapReduceBase implements Mapper < LongWritable, Text, Text, IntWritable > {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
input Key - Value
data format
output Key - Value
data format
must be extened and
implemented
put the word as Key, occurence as
Value into collector
input Key - Value data format match
the output format of Mapper
22. 22
MapReduce Job Configuration
Before running a MapReduce job, the following fields should be set:
• Mapper Class
The mapper class written by yourself to be run.
• Reducer Class
The reducer class written by yourself to be run.
• Input Format & Output Format
Define the format of all input and outputs. A large number of formats are supported in
Hadoop Library.
• OutputKeyClass & OutputValueClass
The data type class of the outputs that Mappers send to Reducers.
23. 23
Example: WordCount
Code to run the job
public class WordCount {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
set output key & value class
set Mapper & Reducer class
set InputFormat & OutputFormat class
set input & output path
25. 25
Example in perf-log
Here shows an example of using MapReduce to analyze the log files in perf-log project.
In log files, There are two kinds of record type and each record is a single line.
Event Level
Request Level
26. 26
Example Using MapReduce
Here we use a MapReduce job to calculate the most used event everyday. All the event
records are collected in Map and the most used events are counted in Reduce.
event PLT_LOGIN
request record…
request record…
event PM_HOME
request record…
event PM_OPENFORM
request record…
request record…
request record…
event CDP_LOGOUT
request record…
request record…
request record…
.
.
.
(11/12, PLT_LOGIN)
(11/12, PM_HOME)
(11/12, PLT_LOGIN)
(11/12, PM_LOGOUT)
.
.
.
(11/13, CDP_LOGIN)
(11/13, CDP_LOGIN)
.
.
.
(11/12, [PLT_LOGIN,
PM_HOME,
PLT_LOGIN,
PLT_LOGOUT…])
(11/13, [CDP_LOGIN,
CDP_LOGIN…])
.
.
.
(11/12, PLT_LOGIN)
(11/13, PM_HOME)
(11/14, CDP_HOME)
(11/15, PLT_HOME)
.
.
.
Map Shuffle(auto) Reduce
28. 28
Table Structure
Tables in HBase have the following features:
1. They are large, sparsely populated tables.
2. Each row has a row key.
3. Table rows are sorted by row key, the table’s primary key.
By default, the sort is byte-ordered.
4. Row columns are grouped into column families. A table’s
column families must be specified up front as part of the
table schema definition and can not be changed.
5. New column family members can be added on demand.
29. 29
Table Structure
Here is the table structure of “perflog” in the pref-log
project:
column family column qualifer
event req
event_name event_id … req1 req1_id … req2 req2_id …
row1 xxx xxx … xxx xxx xxx xxx xxx …
row2 xxx xxx … xxx xxx xxx xxx xxx …
row key
column value
30. 30
Column Design
When designing column families and qualifiers, pay
attention to the following two points:
1. Keep the number of column families in your schema low.
HBase currently does not do well with anything above two or three column families.
2. Name the column families and qualifiers as short as
possible.
Operating on a table in HBase will cause thousands and thousands of compares on
column names. So short names will improve the performance.
31. 31
HBase Command Shell
HBase provides a command shell to operate the
system. Here are some example commands :
• Status
• Create
• List
• Put
• Scan
• Disable & Drop
33. 33
API to Operate Tables in HBase
There are four main methods to operate a table in
Hbase:
• Get
• Put
• Scan
• Delete
**Put and Scan are widely used in perf-log project.
34. 34
Using Put & Scan in HBase
When using put in HBase, notice:
• AutoFlush
• WAL on Puts
When using scan in HBase, notice:
• Scan Attribute Selection
• Scan Caching
35. 35
Using Scan with Filter
HBase filters are a powerful feature that can greatly enhance your effectiveness
working with data stored in tables. Four filters are used in perf-log project:
• SingleColumnValueFilter
You can use this filter when you have exactly one column that decides if an entire row
should be returned or not.
• RowFilter
This filter gives you the ability to filter data based on row keys.
• PageFilter
You paginate through rows by employing this filter.
• FilterList
Enable you to use several filters at the same time.
36. 36
Using Scan with Filter
• PageFilter
There is a fundamental issue with filtering on physically separate servers. Filters run on
different region servers in parallel and can not retain or communicate their current state
across those boundaries and each filter is required to scan at least up to pageCount
rows before ending the scan. Thus you may get more rows than really you want.
Filter filter = new PageFilter(5); // 5 is the pageCount
int totalRows = 0;
byte[] lastRow = null;
while (true) {
Scan scan = new Scan();
scan.setFilter(filter);
if (lastRow != null) {
scan.setStartRow(startRow);
}
ResultScanner scanner = table.getScanner(scan);
int localRows = 0;
Result result;
while ((result = scanner.next()) != null) {
totalRows++;
lastRow = result.getRow();
}
scanner.close();
if (localRows == 0) break;
}
37. 37
Using Scan with Filter
• FilterList
When using multiple filters with FilterList, pay attention that putting filters into FilterList
in different orders will generate different results.
pageFilter = new PageFilter(5);
singleColumnValueFilter = new SingleColumnValueFilter(“event”, “name”, CompareOp.EQUAL, “PLT_LOGIN”);
Take out the first 5 records and then
return the ones that their event name
values “PLT_LOGIN”.
filterList = new FilterList();
filterList.addFilter(pageFilter);
filterList.assFilter(singleColumnValueFilter);
Take out all the records that their
event name values “PLT_LOGIN” and
then return the first 5 of them.
filterList = new FilterList();
filterList.assFilter(singleColumnValueFilter);
filterList.addFilter(pageFilter);
38. 38
Map Reduce with HBase
Here is an example:
static class MyMapper<K, V> extends MapReduceBase implements Mapper<LongWritable, Text, K, V> {
private HTable table;
@override
public void configure(JobConf jc) {
supper.configure(jc);
try {
this.table = new HTable(HBaseConfiguration.create(), “table_name”);
} catch (IOException e) {
throw new RuntimeException(“Failed HTable construction”, e);
}
}
@override
public void close() throws IOException {
super.close();
table.close();
}
public void map(LongWritable key, Text value, OutputCollector<K, V> output, Reporter reporter) throws IOException {
Put p = new Put();
… // Set your own put.
table.put(p);
}
}
39. 39
Bulk Load
HBase includes several methods of loading data into tables. The most
straightforward method is to either use a MapReduce job, or use the normal
client APIs; however, these are not always the most efficient methods.
The bulk load feature uses a MapReduce job to output table data in HBase's
internal data format, and then directly loads the data files into a running
cluster. Using bulk load will use less CPU and network resources than simply
using the HBase API.
Data
Files
Map
Reduce
Job
HFiles HBase
40. 40
Bulk Load
Notic that we use HFileOutputFormat as the output fomat of the map
reduce job used to generate HFile. But the HFileOutputFormat
provided by HBase Library DO NOT support writing multiple column
families into HFile.
But a Multi-family supported version for HFileOutputFormat can be
found HERE:
https://review.cloudera.org/r/1272/diff/1/?file=17977#file17977line93
41. 41
Thank You, and Questions
See More About Hadoop & HBase:
http://confluence.successfactors.com/display/ENG/Programming+experience+on+Ha
doop+&+HBase
Editor's Notes
http://hadoop.apache.org/common/docs/r0.20.2/hdfs_design.html
Data blocks are automatically replicated cross Data Nodes. Fault-tolerant. Default number of replicates is 3.
Share-nothing architecture. Add data nodes to increase disk capacity and I/O throughput.
Due to replicates and internal structure, the actual capacity will be less than 1/3 of raw data.
Name Node manages file system’s metadata.
SPOF. Need HA and backup.
Workload increases along with files/blocks number and operations. Potential bottleneck.
Job Tracker manages Map/Reduce job execution. Often runs along with Name Node.
Job is split into tasks. Task Tracker manages task execution. Runs on Data Nodes.
Natural distributed parallel computing architecture.
Web console to monitor job/task.
“hadoop” command to run jobs, manage nodes and file system. Specially, “hadoop fs” provide many unix-like commands to access the HDFS.
HMaster manages region servers. It normally runs with Hadoop NameNode together.
Data are sorted by row key and split into regions, which are managed by region server. Region servers often run on data nodes.
Each region includes one MemStore and several store files.
Data writes are recorded into “Write-Ahead-Log” (HLog, but by default it is flushed to disk every 1 second), and written into MemStore.
When Memstore becomes full, it is flushed to HDFS as a store file.
Full operations: get, put, scan, delete.
GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.
Required software for Linux and Windows include:
Java 1.6.x, preferably from Sun, must be installed.
ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.
Additional requirements for Windows include:
Cygwin - Required for shell support in addition to the required software above.
1. Hadoop Version
The most used Hadoop version
0.20.203.X
the current stable version, DO NOT contain the entire new API of MapReduce, DO NOT have sync attribute on HDFS, currently used in perf-log project.
0.20.205.X
the current beta version, DO NOT contain the entire new API of MapReduce, has sync attribute on HDFS.
0.21.X
the newest version, provide the entire new API of MapReduce, unstable, unsupported, does not include security, can not run with HBase.
2. Running HBase on Hadoop
The newest version of HBase is 0.90.x. This version of HBase will only run on Hadoop 0.20.x. It will not run on hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an HDFS that has a durable sync. Hadoop 0.20.2 and Hadoop 0.20.203.0 DO NOT have this attribute. You choose one of the following solutions:
HBase bundles an instance of the hadoop jar under its lib directory. The bundled Hadoop was made from the Apache branch-0.20-append branch at the time of the HBase‘s release and has the sync attribuate. Replace the hadoop jar you are running on your cluster with the hadoop jar found in the HBase lib directory.
You could use the Cloudera or MapR distributions. Cloudera' CDH3 is Apache Hadoop 0.20.x plus patches including all of the 0.20-append additions needed to add a durable sync. CHD3 contains both Hadoop and HBase in its production.
Just use Hadoop 0.20.205.0. Since this release includes a merge of append/hsynch/hflush capabilities from 0.20-append branch, it can support HBase in secure mode. But it is a beta version.
In Hadoop MapReduce, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs). The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message.
Hadoop uses its own serialization format, Writables, which is certainly compact and fast (but not so easy to extend, or use from languages other than Java).
Hadoop Library provides many basic data types to be used in MapReduce, and you can also implement you own data structure according to the interfaces of Writables.
Status
Show the status of all nodes in HBase.
Create
Create a table.
List
List all the existing tables.
Put
Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists).
Scan
Scan allows iteration over multiple rows for specified attributes of a certain table.
Disable & Drop
First do disable, then do drop when deleting a table.
Get
Get returns attributes for a specified row.
Put
Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists).
Scan
Scan allows iteration over multiple rows for specified attributes. It can be used with filters and provides powerful query functions on HBase.
Delete
Delete removes a row from a table.
When using put in HBase, notice:
AutoFlush
AutoFlush meets the request of real time, and you can immediately see the row after it is added into the table. But when performing a lot of Puts, make sure that setAutoFlush is set to false on your HTable instance. Otherwise, the Puts will be sent one at a time to the RegionServer. If autoFlush = false, these messages are not sent until the write-buffer is filled, so it can reduce the number of client RPC calls. To explicitly flush the messages, call flushCommits. Calling close on the HTable instance will invoke flushCommits.
WAL on Puts
WAL means Write Ahead Log. Turning this off means that the RegionServer will not write the Put to the Write Ahead Log, only into the memstore and it will improve the performance. HOWEVER turn it off is not recommended because if there is a RegionServer failure there will be data loss.
When using scan in HBase, notice:
Scan Attribute Selection
Whenever a Scan is used to process large numbers of rows, be aware of which attributes are selected. Call the scan.addFamily to appoint the specific column values you want rather than get the entire row, because attribute over-selection is a non-trivial performance penalty over large datasets.
Scan Caching
When peforming a large number of Scans, make sure that the input Scan instance has setCaching set to something greater than the default (which is 1). Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn't always better.
HBase can be both the input and output of a Map Reduce Job. In the perf-log project, we use HBase as the output of the MR job and it is best to obey the following rules:
Get one HTable instance
There is a cost instantiating an HTable, so if you do this for each insert, you may have a negative impact on performance. Hence our setup of HTable in the configure() step.
Skip the Reducer if possible
When writing a lot of data to an HBase table from a MR job and specifically where Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase.