Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
Pig at LinkedIn by Chris Riccomini from LinkedIn
Pig is an integral part of data analytics at LinkedIn. Learn about LinkedIn’s analytic stack, and see how Pig is used to design, develop, and deliver data products at LinkedIn. We’ll explore a successful example of Pig deployment at LinkedIn, pain points, and integration with Azkaban, Voldemort, Hadoop, and the rest of LinkedIn’s ecosystem.
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we'll look at how SDC's "intent-driven" approach keeps the data flowing, whether you're processing data 'off-cluster', in Spark, or in MapReduce.
StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. StreamSets Data Collector is in use at hundreds of companies where it brings unprecedented visibility into and control over data as it moves between an expanding variety of sources and destinations.
Speakers:
Pat Patterson has been working with Internet technologies since 1997, building software and working with communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. Part of the developer evangelism team at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Dmitriy Setrakyan, Apache Ignite Project Management Committee Chairman and co-founder and CPO at GridGain will explain in detail how IgniteRDD — an implementation of native Spark RDD and DataFrame APIs — shares the state of the RDD across other Spark jobs, applications and workers. Dmitriy will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Don't miss this opportunity to learn from one of the experts how to use Spark and Ignite better together in your projects.
Speakers:
Dmitriy Setrakyan, is a founder and CPO at GridGain Systems. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy also acts as PMC chair of Apache Ignite project.
First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs.
Speakers:
Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer.
Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
The next generation of Hadoop MapReduce
Arun C. Murthy presented the plans for the next generation of Apache Hadoop MapReduce. The MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.
More information and video available at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
Bay Area HUG Feb 2011 introduction and Yahoo refocusing on Apache Hadoop releases.
More information at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
Andrew Ryan describes how Facebook operates Hadoop to provide access as a shared resource between groups.
More information and video at:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Nick Galbreath
Rate Limits at Scale SANS AppSec Las Vegas.
Rate Limit Everything All the time using a quantized time system with Memcache or Redis. Use this protect resources or discover anomalies.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
MindScripts Technologies, is the leading Big-Data Hadoop Training institutes in Pune, providing a complete Big-Data Hadoop Course with Cloud-Era certification.
Similar to Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter (20)
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!