We help you get web data hassle free. This deck introduces the different use cases that are most beneficial to finance companies and those looking to scale revenue using web data.
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachJeremy Zawodny
From the 2012 Percona Live MySQL Conference in Santa Clara, CA.
Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.
No se pierda esta oportunidad de conocer las ventajas de NoSQL. Participe en nuestro seminario web y descubra:
Qué significa el término NoSQL
Qué diferencias hay entre los almacenes clave-valor, columna ancha, grafo y de documentos
Qué significa el término «multimodelo»
These are the slides I presented at the Nosql Night in Boston on Nov 4, 2014. The slides were adapted from a presentation given by Steve Francia in 2011. Original slide deck can be found here:
http://spf13.com/presentation/mongodb-sort-conference-2011
Log File Analysis: The most powerful tool in your SEO toolkitTom Bennet
Slide deck from Tom Bennet's presentation at Brighton SEO, September 2014. Accompanying guide can be found here: http://builtvisible.com/log-file-analysis/
Image Credits:
https://www.flickr.com/photos/nullvalue/4188517246
https://www.flickr.com/photos/small_realm/11189803763/
https://www.flickr.com/photos/florianric/7263382550
http://fotojenix.wordpress.com/2011/07/08/weekly-photo-challenge-old-fashioned/
We help you get web data hassle free. This deck introduces the different use cases that are most beneficial to finance companies and those looking to scale revenue using web data.
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachJeremy Zawodny
From the 2012 Percona Live MySQL Conference in Santa Clara, CA.
Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.
No se pierda esta oportunidad de conocer las ventajas de NoSQL. Participe en nuestro seminario web y descubra:
Qué significa el término NoSQL
Qué diferencias hay entre los almacenes clave-valor, columna ancha, grafo y de documentos
Qué significa el término «multimodelo»
These are the slides I presented at the Nosql Night in Boston on Nov 4, 2014. The slides were adapted from a presentation given by Steve Francia in 2011. Original slide deck can be found here:
http://spf13.com/presentation/mongodb-sort-conference-2011
Log File Analysis: The most powerful tool in your SEO toolkitTom Bennet
Slide deck from Tom Bennet's presentation at Brighton SEO, September 2014. Accompanying guide can be found here: http://builtvisible.com/log-file-analysis/
Image Credits:
https://www.flickr.com/photos/nullvalue/4188517246
https://www.flickr.com/photos/small_realm/11189803763/
https://www.flickr.com/photos/florianric/7263382550
http://fotojenix.wordpress.com/2011/07/08/weekly-photo-challenge-old-fashioned/
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
Web Scraping Introduction. It will cover cover all the most available libraries and the way they can be handled to scrape our required data. Created by Littin Rajan
View all the MongoDB World 2016 Poster Sessions slides in one place!
Table of Contents:
1: BigData DB Infrastructure for Modeling the Fly Brain
2: Taming the WiredTiger Cache
3: Sharding with MongoDB 3.2 Kick the tires and pop the hood!
4: Scaling Proactive Anomaly Detection
5: MongoTx: Transactions with Sharding and Queries
6: MongoDB: It’s Not Too Late To Shard
7: DLIFLC usage of MongoDB
IPFS is a distribution protocol that enables the creation of completely distributed applications through content addressing. A very ambitious open source project in Go, IPFS adopts a peer-to-peer hypermedia protocol to protect against a single point of failure. This presentation aims to highlight the design and ideas of IPFS and also touches upon a real world use case.
We went over what Big Data is and it's value. This talk will cover the details of Elasticsearch, a Big Data solution. Elasticsearch is an NoSQL-backed search engine using a HDFS-based filesystem.
We'll cover:
• Elasticsearch basics
• Setting up a development environment
• Loading data
• Searching data using REST
• Searching data using NEST, the .NET interface
• Understanding Scores
Finally, I show a use-case for data mining using Elasticsearch.
You'll walk away from this armed with the knowledge to add Elasticsearch to your data analysis toolkit and your applications.
Following the classical software architecture patterns we tend to design large monolith of software applications.
These monoliths are typically quite difficult to scale as they often require powerful machines, making the option to scale out very expensive.
In most cases these monoliths of software are designed to run on a single machine only, hence scaling out is complicated or even impossible without refactoring large portions of the application.
Therefore a new design pattern called microservices arose.
The pattern of microservices keeps the need of a clustered server setup in mind and helps to keep the application very modular.
This allows to simplify a scale out of your application and even allows to scale the bottlenecks of your application only and hence reducing the total cost for a scale out approach.
In this talk I will introduce the concept of microservices, how they are defined and how to design an application with them.
Furthermore I will show how to scale the application properly and why this is only possible due to the use of microservices.
Also we will have a look at Node.js and why it is a perfect, though not the only, fit to this design strategy.
However scaling is not the only purpose of microservices, they also increase the flexibility and maintainability of applications, this will also be discussed in the talk.
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
Intro to MongoDB
Get a jumpstart on MongoDB, use cases, and next steps for building your first app with Buzz Moschetti, MongoDB Enterprise Architect.
@BuzzMoschetti
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
Presented on Codemotion Warsaw 2016 and JDD 2016.
Pig, Hive, Flink, Kafka, Zeppelin... if you now wonder if someone just tried to offend you or are those just Pokemon names, then this talk is just for you!
Big Data is everywhere and new tools for it are released almost at the speed of new JavaScript frameworks. During this entry level presentation we will walk though the challenges which Big Data presents, reflect how big is big and introduce currently most fancy and popular (mostly open source) tools.
We'll try to spark off interest in Big Data by showing application areas and by throwing ideas where you can later dive into.
Elasticsearch Distributed search & analytics on BigData made easyItamar
Elasticsearch is a cloud-ready, super scalable search engine which is gaining a lot of popularity lately. It is mostly known for being extremely easy to setup and integrate with any technology stack.In this talk we will introduce Elasticdearch, and start by looking at some of its basic capabilities. We will demonstrate how it can be used for document search and even log analytics for DevOps and distributed debugging, and peek into more advanced usages like the real-time aggregations and percolation. Obviously, we will make sure to demonstrate how Elasticsearch can be scaled out easily to work on a distributed architecture and handle pretty much any load.
Recent releases of the .NET driver have added lots of cool new features. In this webinar we will highlight some of the most important ones. We will begin by discussing serialization. We will describe how serialization is normally handled, and how you can customize the process when you need to, including some tips on migration strategies when your class definitions change. We will continue with a discussion of the new Query builder, which now includes support for typed queries. A major new feature of recent releases is support for LINQ queries. We will show you how the .NET driver supports LINQ and discuss what kinds of LINQ queries are supported. Finally, we will discuss what you need to do differently in your application when authentication is enabled at the server.
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
Web Scraping Introduction. It will cover cover all the most available libraries and the way they can be handled to scrape our required data. Created by Littin Rajan
View all the MongoDB World 2016 Poster Sessions slides in one place!
Table of Contents:
1: BigData DB Infrastructure for Modeling the Fly Brain
2: Taming the WiredTiger Cache
3: Sharding with MongoDB 3.2 Kick the tires and pop the hood!
4: Scaling Proactive Anomaly Detection
5: MongoTx: Transactions with Sharding and Queries
6: MongoDB: It’s Not Too Late To Shard
7: DLIFLC usage of MongoDB
IPFS is a distribution protocol that enables the creation of completely distributed applications through content addressing. A very ambitious open source project in Go, IPFS adopts a peer-to-peer hypermedia protocol to protect against a single point of failure. This presentation aims to highlight the design and ideas of IPFS and also touches upon a real world use case.
We went over what Big Data is and it's value. This talk will cover the details of Elasticsearch, a Big Data solution. Elasticsearch is an NoSQL-backed search engine using a HDFS-based filesystem.
We'll cover:
• Elasticsearch basics
• Setting up a development environment
• Loading data
• Searching data using REST
• Searching data using NEST, the .NET interface
• Understanding Scores
Finally, I show a use-case for data mining using Elasticsearch.
You'll walk away from this armed with the knowledge to add Elasticsearch to your data analysis toolkit and your applications.
Following the classical software architecture patterns we tend to design large monolith of software applications.
These monoliths are typically quite difficult to scale as they often require powerful machines, making the option to scale out very expensive.
In most cases these monoliths of software are designed to run on a single machine only, hence scaling out is complicated or even impossible without refactoring large portions of the application.
Therefore a new design pattern called microservices arose.
The pattern of microservices keeps the need of a clustered server setup in mind and helps to keep the application very modular.
This allows to simplify a scale out of your application and even allows to scale the bottlenecks of your application only and hence reducing the total cost for a scale out approach.
In this talk I will introduce the concept of microservices, how they are defined and how to design an application with them.
Furthermore I will show how to scale the application properly and why this is only possible due to the use of microservices.
Also we will have a look at Node.js and why it is a perfect, though not the only, fit to this design strategy.
However scaling is not the only purpose of microservices, they also increase the flexibility and maintainability of applications, this will also be discussed in the talk.
Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.
Intro to MongoDB
Get a jumpstart on MongoDB, use cases, and next steps for building your first app with Buzz Moschetti, MongoDB Enterprise Architect.
@BuzzMoschetti
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation.
Presented on Codemotion Warsaw 2016 and JDD 2016.
Pig, Hive, Flink, Kafka, Zeppelin... if you now wonder if someone just tried to offend you or are those just Pokemon names, then this talk is just for you!
Big Data is everywhere and new tools for it are released almost at the speed of new JavaScript frameworks. During this entry level presentation we will walk though the challenges which Big Data presents, reflect how big is big and introduce currently most fancy and popular (mostly open source) tools.
We'll try to spark off interest in Big Data by showing application areas and by throwing ideas where you can later dive into.
Elasticsearch Distributed search & analytics on BigData made easyItamar
Elasticsearch is a cloud-ready, super scalable search engine which is gaining a lot of popularity lately. It is mostly known for being extremely easy to setup and integrate with any technology stack.In this talk we will introduce Elasticdearch, and start by looking at some of its basic capabilities. We will demonstrate how it can be used for document search and even log analytics for DevOps and distributed debugging, and peek into more advanced usages like the real-time aggregations and percolation. Obviously, we will make sure to demonstrate how Elasticsearch can be scaled out easily to work on a distributed architecture and handle pretty much any load.
Recent releases of the .NET driver have added lots of cool new features. In this webinar we will highlight some of the most important ones. We will begin by discussing serialization. We will describe how serialization is normally handled, and how you can customize the process when you need to, including some tips on migration strategies when your class definitions change. We will continue with a discussion of the new Query builder, which now includes support for typed queries. A major new feature of recent releases is support for LINQ queries. We will show you how the .NET driver supports LINQ and discuss what kinds of LINQ queries are supported. Finally, we will discuss what you need to do differently in your application when authentication is enabled at the server.
How many times have you wanted to find some information on a website only to be disappointed with the filtering and discovery options available. Learn how to get data from a site and look for the data that you really care about.
All you need to know about XPath 1.0 in a web scraping project: the different axes, attribute matching, string functions, EXSLT extensions plus a few other handy patterns like CSS selectors and Javascript parsing.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
As Uber continues to grow, our big data systems need to grow in scalability, reliability, and performance, to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Since 2016, we put Presto in production. Now Presto is serving ~100K queries per day @ Uber, and it becomes a key component for interactive SQL queries on big data. In this presentation, we would like to talk about our experiences and engineering efforts, we start with general introduction about Hadoop Infrastructure & Analytics @ Uber, then comes a brief introduction to Presto, the Interactive SQL engine for big data. We will focus on how we build the New Parquet Reader for Presto, and the detail techniques, Columnar Reads, Lazy Reads, Nested Column Pruning. We will show performance improvements and Uber's Use Cases. Finally, we would like to share our ongoing plan and future work for Big Data Analytics @ Uber.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
Introduction to Big Data Technologies & ApplicationsNguyen Cao
Big Data Myths, Current Mainstream Technologies related to Collecting, Storing, Computing & Stream Processing Data. Real-life experience with E-commerce businesses.
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
Benjamin Hopp (Solutions Architect) @ Imply:
Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets.
This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit.
Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics.
Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack.
The most important contributor to a fast analytical setup is getting the data model right.
The talk will center around various choices you can make to prepare your data to get best possible query performance.
We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes.
We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed.
We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage.
You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more.
And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services
Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you'll learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.
44CON 2014: Using hadoop for malware, network, forensics and log analysisMichael Boman
The number of new malware samples are over a hundred thousand a day, network speeds are measured in multiple of ten gigabits per second, computer systems have terabytes of storage and the log files are just piling up. By using Hadoop you can tackle these problems in a whole different way, and “Too Much Data to Process” will be a thing of the past.
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
4. Founded in 2010, largest 100% remote company based outside of the US
We’re 126 teammates in 41 countries
5. About Scrapinghub
Scrapinghub specializes in data extraction. Our platform is
used to scrape over 4 billion web pages a month.
We offer:
● Professional Services to handle the web scraping for you
● Off-the-shelf datasets so you can get data hassle free
● A cloud-based platform that makes scraping a breeze
6. Who Uses Web Scraping
Used by everyone from individuals to
multinational companies:
● Monitor your competitors’ prices by scraping
product information
● Detect fraudulent reviews and sentiment
changes by scraping product reviews
● Track online reputation by scraping social
media profiles
● Create apps that use public data
● Track SEO by scraping search engine results
7. “Getting information off the
Internet is like taking a drink
from a fire hydrant.”
– Mitchell Kapor
8. Scrapy
Scrapy is a web scraping framework that
gets the dirty work related to web crawling
out of your way.
Benefits
● No platform lock-in: Open Source
● Very popular (13k+ ★)
● Battle tested
● Highly extensible
● Great documentation
9. Introducing Portia
Portia is a Visual Scraping tool that lets you
get data without needing to write code.
Benefits
● No platform lock-in: Open Source
● JavaScript dynamic content generation
● Ideal for non-developers
● Extensible
● It’s as easy as annotating a page
10. How Portia Works
User provides seed URLs:
Follows links
● Users specify which links to follow (regexp, point-and-click)
● Automatically guesses: finds and follows pagination, infinite scroll, prioritizes content
● Knows when to stop
Extracts data
● Given a sample, extracts the same data from all similar pages
● Understands repetitive patterns
● Manages item schemas
Run standalone or on Scrapy Cloud
12. Large Scale Infrastructure
Meet Scrapy Cloud , our PaaS for web crawlers:
● Scalable: Crawlers run on our cloud infrastructure
● Crawlera add-on
● Control your spiders: Command line, API or web UI
● Machine learning integration: BigML, MonkeyLearn, among others
● No lock-in: scrapyd, Scrapy or Portia to run spiders on your own
infrastructure
13. Data Growth
● Items, logs and requests are collected in real time
● Millions of web crawling jobs each month
● Now at 4 billion a month and growing
● Thousands of separate active projects
14. ● Browse data as the crawl is running
● Filter and download huge datasets
● Items can have arbitrary schemas
Data Dashboard
15. MongoDB - v1.0
MongoDB was a good fit to get a demo up and
running, but it’s a bad fit for our use at scale
● Cannot keep hot data in memory
● Lock contention
● Cannot order data without sorting, skip+limit
queries slow
● Poor space efficiency
See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/
16. ● High write volume. Writes are micro-batched
● Much of the data is written in order and immutable (like logs)
● Items are semi-structured nested data
● Expect exponential growth
● Random access from dashboard users, keep summary stats
● Sequential reading important (downloading & analyzing)
● Store data on disk, many TB per node
Storage Requirements - v2.0
17. Bigtable looks good...
Google’s Bigtable provides a sparse,
distributed, persistent
multidimensional sorted map
Can express our requirements in what
Bigtable provides
Performance characteristics should
match our workload
Inspired several open source projects
18. Apache HBase
● Modelled after Google’s Bigtable
● Provides real time random read and write to billions of rows with
millions of columns
● Runs on hadoop and uses HDFS
● Strictly consistent reads and writes
● Extensible via server side filters and coprocessors
● Java-based
20. HBase Key Selection
Key selection is critical
● Atomic operations are at the row level: we use fat columns, update counts on write
operations and delete whole rows at once
● Order is determined by the binary key: our offsets preserve order
21. HBase Values
● Msgpack is like JSON but fast and small
● Storing entire records as a value has low
overhead (vs. splitting records into multiple
key/values in hbase)
● Doesn’t handle very large values well, requires
us to limit the size of single records
● We need arbitrarily nested data anyway, so we
need some custom binary encoding
● Write custom Filters to support simple queries
We store the entire item record as msgpack encoded data in a single value
22. HBase Deployment
● All access is via a single service that provides a restricted API
● Ensure no long running queries, deal with timeouts everywhere, ...
● Tune settings to work with a lot of data per node
● Set block size and compression for each Column Family
● Do not use block cache for large scans (Scan.setCacheBlocks) and
‘batch’ every time you touch fat columns
● Scripts to manage regions (balancing, merging, bulk delete)
● We host in Hetzner, on dedicated servers
● Data replicated to backup clusters, where we run analytics
23. HBase Lessons Learned
● It was a lot of work
○ API is low level (untyped bytes) - check out Apache Phoenix
○ Many parts -> longer learning curve and difficult to debug. Tools
are getting better
● Many of our early problems were addressed in later releases
○ reduced memory allocation & GC times
○ improved MTTR
○ online region merging
○ scanner heartbeat
25. Broad Crawls
Frontera allows us to build large scale web crawlers in Python:
● Scrapy support out of the box
● Distribute and scale custom web crawlers across servers
● Crawl Frontier Framework: large scale URL prioritization logic
● Aduana to prioritize URLs based on link analysis (PageRank, HITS)
26. Broad Crawls
Many uses of Frontera:
○ News analysis, Topical crawling
○ Plagiarism detection
○ Sentiment analysis (popularity, likeability)
○ Due diligence (profile/business data)
○ Lead generation (extracting contact information)
○ Track criminal activity & find lost persons (DARPA)
28. Frontera Architecture
Supports both local and distributed mode
● Scrapy for crawl spiders
● Kafka for message bus
● HBase for storage and frontier
maintenance
● Twisted.Internet for async primitives
● Snappy for compression
29. Frontera: Big and Small hosts
Ordering of URLs across hosts is important:
● Politeness: a single host crawled by one Scrapy process
● Each Scrapy process crawls multiple hosts
Challenges we found at scale:
Queue flooded with URLs from the same host.
○ Underuse of spider resources.
Additional per-host (per-IP) queue and metering algorithm.
URLs from big hosts are cached in memory.
○ Found a few very huge hosts (>20M docs)
All queue partitions were flooded with huge hosts.
Two MapReduce jobs: queue shuffling, limit all hosts to 100
docs MAX.
30. Breadth-first strategy: huge amount of DNS requests
● Recursive DNS server on every spider node, upstream to
Verizon & OpenDNS
● Scrapy patch for large thread pool for DNS resolving and
timeout customization
Intensive network traffic from workers to services
● Throughput between workers and Kafka/HBase ~ 1Gbit/s
● Thrift compact protocol for HBase
● Message compression in Kafka with Snappy
Batching and caching to achieve performance
Frontera: tuning
31. Duplicate Content
The web is full of duplicate content.
Duplicate Content negatively impacts:
● Storage
● Re-crawl performance
● Quality of data
Efficient algorithms for Near Duplicate Detection, like SimHash, are
applied to estimate similarity between web pages to avoid scraping
duplicated content.
32. Near Duplicate Detection Uses
Compare prices of products scraped from different retailers by finding
near duplicates in a dataset:
Merge similar items to avoid duplicate entries:
Title Store Price
ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89
Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95
Name Summary Location
Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the
Victorian architect William Burges…
51.8944, -8.48064
St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695
33. What we’re seeing..
● More data is available than ever
● Scrapinghub can provide web data in a usable format
● We’re combining multiple data sources and analyzing
● The technology to use big data is rapidly improving and
becoming more accessible
● Data Science is everywhere
8 years ago I started scraping in anger. I saw quite a few examples of what not to do.. which is one reason I started to write a framework..
that framework that was later outsourced as scrapy, worked on a visual scraper that turned into portia, etc. worked on design for frontera. If you’ve never heard of these, don’t worry, we’ll get to them in a while
Co-Founded Scrapinghub with Pablo Hoffman
Work with lots of amazing spidermen and spiderwomen - so I’m around web scraping all the time
3 billion pages a month: around 1200 pages per second
Nice things about Scrapy: Async networking. Deals with retrying, redirection, duplicated requests, noscript traps, robots.txt, cookies, logins, throttling, JS (splash), community plugins, scrapy cloud or scrapyd to deploy, tools that make scrapy even better: crawlera, frontera, splash.
Nice things about Portia: open source, uses Splash to render JS code, addons, scraping for non-devs, speedup the work for devs, JavaScript, data journalists can use it
Clients are java based, there is a thrift gateway for non-java clients
Multiple region servers (like data storage nodes).
Each region holds a range of data and hbase maintains its start and end key internally. Once a region grows beyond a certain size, it is split in two.
Many regions per region server.
A directory of what regions are allocated where is kept in a META table, whose location is stored in zookeeper.
Data aggregated in memory (in memstore) and written to WAL.
Memstore periodically flushed.
Hfiles merged together during compaction