This document summarizes a presentation given by Alexander Sibiryakov about Frontera, an open source web crawling framework. Frontera allows building large-scale web crawlers that can crawl billions of pages per month in a distributed manner. It provides abstractions for crawling strategies, message buses, and backend storage. The document describes example uses of Frontera including focused crawls, news analysis, and due diligence. It also outlines the software and hardware requirements and discusses future plans for Frontera.
Weaving the ILP Fabric into Bigchain DBInterledger
Dimitri De Jonghe presents on how Bigchain DB can use Interledger to connect disparate systems. Presented at the Interleder Workshop in London on 7/6/2016. Full presentation here: https://interledger.org/presentations/2016-07-06%20-%20ILP%20Workshop%20London%202016.pdf
The new decentralized compute stack and its applicationBigchainDB
Dimitri De Jonghe of BigchainDB talks about the new decentralized compute stack, which helps to understand how your blockchain application or use case fits.
Examples of current applications and uses are also given.
Please contact BigchainDB for putting your blockchain idea into practice, today.
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017BigchainDB
Towards the internet of value & trust.
"To develop shared global compute infrastructure,
we must first understand the status quo of infrastructure,
...and how to change it accordingly."
Dimitri De Jonghe, lead developer of BigchainDB talking about blockchain technology beyond the financial sector.
Why Blockchain Matters to Big Data - Big Data London Meetup - Nov 3, 2016BigchainDB
Why does blockchain matter to Big Data?
Bruce Pon, CEO and Co-Founder of BigchainDB talks about how blockchain and big data work together.
Follow BigchainDB on LinkedIn, download the whitepaper or sign up with at the IPDB Foundation to get access to a first test network build with BigchainDB to build your own blockchain application.
Weaving the ILP Fabric into Bigchain DBInterledger
Dimitri De Jonghe presents on how Bigchain DB can use Interledger to connect disparate systems. Presented at the Interleder Workshop in London on 7/6/2016. Full presentation here: https://interledger.org/presentations/2016-07-06%20-%20ILP%20Workshop%20London%202016.pdf
The new decentralized compute stack and its applicationBigchainDB
Dimitri De Jonghe of BigchainDB talks about the new decentralized compute stack, which helps to understand how your blockchain application or use case fits.
Examples of current applications and uses are also given.
Please contact BigchainDB for putting your blockchain idea into practice, today.
Blockchain Beyond Finance - Cronos Groep - Jan 17, 2017BigchainDB
Towards the internet of value & trust.
"To develop shared global compute infrastructure,
we must first understand the status quo of infrastructure,
...and how to change it accordingly."
Dimitri De Jonghe, lead developer of BigchainDB talking about blockchain technology beyond the financial sector.
Why Blockchain Matters to Big Data - Big Data London Meetup - Nov 3, 2016BigchainDB
Why does blockchain matter to Big Data?
Bruce Pon, CEO and Co-Founder of BigchainDB talks about how blockchain and big data work together.
Follow BigchainDB on LinkedIn, download the whitepaper or sign up with at the IPDB Foundation to get access to a first test network build with BigchainDB to build your own blockchain application.
BigchainDB: Blockchains for Artificial Intelligence by Trent McConaghyBigchainDB
How can blockchains help AI?
-Decentralized model exchange
-Model audit trail
-AI DAOs
-more
A blockchain caveat or two
Completely new code bases
Reinventing consensus
No sharding = no scaling
No querying // single-node querying
Let’s fix this...
Blockchain Satellites - The Future of Space CommerceHasshi Sudler
Presentation made on 10/26/2020 outlining the launch of the first private blockchain into space on the Firefly Aerospace rocket planned for late December, 2020. This presentation is delivered by Hasshi Sudler and Alejandro Gomez of Villanova University and Elizabeth Kennick and Joe Latrell of Teachers In Space.
This presentation goes over consensus fundamentals, what consensus algorithms are used in Hyperledger blockchain projects today and how do they work. This presentation was presented at the April 2nd SF Hyperledger Meetup @ PubNub.
Indexing Decentralized Data with Ethereum, IPFS & The GraphStefan Adolf
The deck for a lightning talk I gave at Coding Berlin November 2019. Demonstrates how you can index data from a decentralized ledger (Ethereum) and filesystem (IPFS) using "The Graph" nodes and query them by using a pure GraphQL API.
Rather than trying to scale up blockchain technology, BigchainDB starts with a big data distributed database and then adds blockchain characteristics - decentralized control, immutability and the transfer of digital assets.
Eagle6 is a product that use system artifacts to create a replica model that represents a near real-time view of system architecture. Eagle6 was built to collect system data (log files, application source code, etc.) and to link system behaviors in such a way that the user is able to quickly identify risks associated with unknown or unwanted behavioral events that may result in unknown impacts to seemingly unrelated down-stream systems. This session is designed to present the capabilities of the Eagle6 modeling product and how we are using MongoDB to support near-real-time analysis of large disparate datasets.
CPaaS.io Y1 Review Meeting - Holistic Data ManagementStephan Haller
Data management and governance aspects of the CPaaS.io platform as presented at the first year review meeting in Tokyo on October 5, 2017.
Disclaimer:
This document has been produced in the context of the CPaaS.io project which is jointly funded by the European Commission (grant agreement n° 723076) and NICT from Japan (management number 18302). All information provided in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. For the avoidance of all doubts, the European Commission and NICT have no liability in respect of this document, which is merely representing the view of the project consortium. This document is subject to change without notice.
An online training course run by the FIWARE Foundation in conjunction with the i4Trust project. The core part of this virtual training camp (21-24 June 2021) covered all the necessary skills to develop smart solutions powered by FIWARE. It introduces the basis of Digital Twin programming using linked data concepts - JSON-LD and NGSI-LD and combines these with common smart data models for the sharing and augmentation of context data.
In addition, it covers the supplementary FIWARE technologies used to implement the common functions typically required when architecting a complete smart solution: Identity and Access Management (IAM) functions to secure access to digital twin data and functions enabling the interface with IoT and 3rd systems, or the connection with different tools for processing and monitoring current and historical big data.
This 12-hour online training course can be used to obtain a good understanding of FIWARE and NGSI Interfaces and form the basis of studying for the FIWARE expert certification.
Extending this core part, the virtual training camp adds introductory and deep-dive sessions on how FIWARE and iSHARE technologies, brought together under the umbrella of the i4Trust initiative, can be combined to provide the means for the creation of data spaces in which multiple organizations can exchange digital twin data in a trusted and efficient manner, collaborating in the creation of innovative services based on data sharing. In addition, SMEs and Digital Innovation Hubs (DIHs) that go through this complete training and are located in countries eligible under Horizon 2020 will be equipped with the necessary know-how to apply to the recently launched i4Trust Open Call.
What is RecordsKeeper?
RecordsKeeper is an Open Source, Open Public Mineable Blockchain for Record Keeping & Data Security. It allows anyone to publish upto 8MB of data in key-value pair format while paying fees in XRK coins onto the RecordsKeeper Blockchain as a part of transaction & retrieve it any time in future for free using record key or transaction id. Data/Records uploaded in RecordsKeeper Platform are immutable & verifiable without any trusted third party.
Powered by high-end Encryption & Blockchain Technology, RecordsKeeper Public Blockchain allows anyone to create verifiable & immutable records which are not possible in traditional technologies like MySQL, Oracle, MSSQL etc. It can also be seen as a tool to generate a Proof-of-Existence, Proof-of-Authenticity & Proof-of-Integrity of a file, record, JSON/XML Object, document, certificate, degree on Blockchain.
RecordsKeeper offers a full suite of structured and easily accessible record keeping for organizations and individuals. RecordsKeeper creates a platform for structured storage over the decentralized network for the ease of data access and security between peers. The RecordsKeeper capitalizes over the pros of the Blockchain network to create an ecosystem for secure transfer, authorization, integrity, and authenticity of data.
This session will focus on how to integrate the voices of youth and families into your work in a meaningful, productive way that can improve your outcomes and service delivery. The first part of the session will include presentations on current efforts to engage youth and families in various fields in Ohio, including youth facing mental health challenges and who are involved in the juvenile justice and foster care systems. The second part of the session will involve small group brainstorming about concrete action steps you can take back to your organization to begin or continue youth and family engagement.
Comunicazione tecnica a misura di nativi digitali nell’Industria 4.0KEA s.r.l.
Visuale, interattiva, multimediale e sociale: ecco come sarà la comunicazione tecnica I4.0 rivolta ai nativi digitali; con monitoraggio e profilazione che giocheranno un ruolo fondamentale nel miglioramento continuo e nella personalizzazione di informazioni, prodotti e servizi.
Spunti dal bellissimo libro di Annalisa Magone e Tatiana Mazali (a cura di), Industria 4.0. Uomini e macchine nella fabbrica digitale, Guerini e Associati, Milano, 2016
BigchainDB: Blockchains for Artificial Intelligence by Trent McConaghyBigchainDB
How can blockchains help AI?
-Decentralized model exchange
-Model audit trail
-AI DAOs
-more
A blockchain caveat or two
Completely new code bases
Reinventing consensus
No sharding = no scaling
No querying // single-node querying
Let’s fix this...
Blockchain Satellites - The Future of Space CommerceHasshi Sudler
Presentation made on 10/26/2020 outlining the launch of the first private blockchain into space on the Firefly Aerospace rocket planned for late December, 2020. This presentation is delivered by Hasshi Sudler and Alejandro Gomez of Villanova University and Elizabeth Kennick and Joe Latrell of Teachers In Space.
This presentation goes over consensus fundamentals, what consensus algorithms are used in Hyperledger blockchain projects today and how do they work. This presentation was presented at the April 2nd SF Hyperledger Meetup @ PubNub.
Indexing Decentralized Data with Ethereum, IPFS & The GraphStefan Adolf
The deck for a lightning talk I gave at Coding Berlin November 2019. Demonstrates how you can index data from a decentralized ledger (Ethereum) and filesystem (IPFS) using "The Graph" nodes and query them by using a pure GraphQL API.
Rather than trying to scale up blockchain technology, BigchainDB starts with a big data distributed database and then adds blockchain characteristics - decentralized control, immutability and the transfer of digital assets.
Eagle6 is a product that use system artifacts to create a replica model that represents a near real-time view of system architecture. Eagle6 was built to collect system data (log files, application source code, etc.) and to link system behaviors in such a way that the user is able to quickly identify risks associated with unknown or unwanted behavioral events that may result in unknown impacts to seemingly unrelated down-stream systems. This session is designed to present the capabilities of the Eagle6 modeling product and how we are using MongoDB to support near-real-time analysis of large disparate datasets.
CPaaS.io Y1 Review Meeting - Holistic Data ManagementStephan Haller
Data management and governance aspects of the CPaaS.io platform as presented at the first year review meeting in Tokyo on October 5, 2017.
Disclaimer:
This document has been produced in the context of the CPaaS.io project which is jointly funded by the European Commission (grant agreement n° 723076) and NICT from Japan (management number 18302). All information provided in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. For the avoidance of all doubts, the European Commission and NICT have no liability in respect of this document, which is merely representing the view of the project consortium. This document is subject to change without notice.
An online training course run by the FIWARE Foundation in conjunction with the i4Trust project. The core part of this virtual training camp (21-24 June 2021) covered all the necessary skills to develop smart solutions powered by FIWARE. It introduces the basis of Digital Twin programming using linked data concepts - JSON-LD and NGSI-LD and combines these with common smart data models for the sharing and augmentation of context data.
In addition, it covers the supplementary FIWARE technologies used to implement the common functions typically required when architecting a complete smart solution: Identity and Access Management (IAM) functions to secure access to digital twin data and functions enabling the interface with IoT and 3rd systems, or the connection with different tools for processing and monitoring current and historical big data.
This 12-hour online training course can be used to obtain a good understanding of FIWARE and NGSI Interfaces and form the basis of studying for the FIWARE expert certification.
Extending this core part, the virtual training camp adds introductory and deep-dive sessions on how FIWARE and iSHARE technologies, brought together under the umbrella of the i4Trust initiative, can be combined to provide the means for the creation of data spaces in which multiple organizations can exchange digital twin data in a trusted and efficient manner, collaborating in the creation of innovative services based on data sharing. In addition, SMEs and Digital Innovation Hubs (DIHs) that go through this complete training and are located in countries eligible under Horizon 2020 will be equipped with the necessary know-how to apply to the recently launched i4Trust Open Call.
What is RecordsKeeper?
RecordsKeeper is an Open Source, Open Public Mineable Blockchain for Record Keeping & Data Security. It allows anyone to publish upto 8MB of data in key-value pair format while paying fees in XRK coins onto the RecordsKeeper Blockchain as a part of transaction & retrieve it any time in future for free using record key or transaction id. Data/Records uploaded in RecordsKeeper Platform are immutable & verifiable without any trusted third party.
Powered by high-end Encryption & Blockchain Technology, RecordsKeeper Public Blockchain allows anyone to create verifiable & immutable records which are not possible in traditional technologies like MySQL, Oracle, MSSQL etc. It can also be seen as a tool to generate a Proof-of-Existence, Proof-of-Authenticity & Proof-of-Integrity of a file, record, JSON/XML Object, document, certificate, degree on Blockchain.
RecordsKeeper offers a full suite of structured and easily accessible record keeping for organizations and individuals. RecordsKeeper creates a platform for structured storage over the decentralized network for the ease of data access and security between peers. The RecordsKeeper capitalizes over the pros of the Blockchain network to create an ecosystem for secure transfer, authorization, integrity, and authenticity of data.
This session will focus on how to integrate the voices of youth and families into your work in a meaningful, productive way that can improve your outcomes and service delivery. The first part of the session will include presentations on current efforts to engage youth and families in various fields in Ohio, including youth facing mental health challenges and who are involved in the juvenile justice and foster care systems. The second part of the session will involve small group brainstorming about concrete action steps you can take back to your organization to begin or continue youth and family engagement.
Comunicazione tecnica a misura di nativi digitali nell’Industria 4.0KEA s.r.l.
Visuale, interattiva, multimediale e sociale: ecco come sarà la comunicazione tecnica I4.0 rivolta ai nativi digitali; con monitoraggio e profilazione che giocheranno un ruolo fondamentale nel miglioramento continuo e nella personalizzazione di informazioni, prodotti e servizi.
Spunti dal bellissimo libro di Annalisa Magone e Tatiana Mazali (a cura di), Industria 4.0. Uomini e macchine nella fabbrica digitale, Guerini e Associati, Milano, 2016
11 Stats You Didn’t Know About Employee RecognitionOfficevibe
Recognizing employees is one of the most overlooked facets of managements that even great leaders sometimes forget about. Without a good employee recognition strategy, people will feel unappreciated and build up stress.
In fact, the number 1 reason why most Americans leave their jobs is that they don’t feel appreciated . The last thing you want is to have high employee turnover because of poor employee recognition.
Officevibe put together some incredible statistics about employee recognition.
Read more on Officevibe blog:
https://www.officevibe.com/blog/employee-recognition-infographic
Learn more about Officevibe, the simplest tool for a greater workplace:
https://www.officevibe.com/
Follow us on Facebook:
https://www.facebook.com/officevibe
MongoDB has taken a clear lead in adoption among the new generation of databases, including the enormous variety of NoSQL offerings. A key reason for this lead has been a unique combination of agility and scalability. Agility provides business units with a quick start and flexibility to maintain development velocity, despite changing data and requirements. Scalability maintains that flexibility while providing fast, interactive performance as data volume and usage increase. We'll address the key organizational, operational, and engineering considerations to ensure that agility and scalability stay aligned at increasing scale, from small development instances to web-scale applications. We will also survey some key examples of highly-scaled customer applications of MongoDB.
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
В этом докладе я собираюсь поделиться нашим опытом обхода испанского интернета. Мы поставили перед собой задачу обойти около 600 тысяч веб-сайтов в зоне .es с целью сбора статистики об узлах и их размерах. Я расскажу об архитектуре робота, хранилища, проблемах, с которыми мы столкнулись при обходе, и их решении.
Наше решение доступно в форме open source фреймворка Frontera. Фреймворк позволяет построить распределенного робота для скачивания страниц из Интернета в больших объемах в реальном времени. Также он может быть использован для построения сфокусированных роботов для выкачивания подмножества заранее известных веб-сайтов.
Фреймворк предлагает: настраиваемое хранилище URL документов (RDBMS или Key Value), управление стратегиями обхода, абстракцию транспортного уровня, абстракцию модуля загрузки.
Доклад построен в увлекательной форме: описание проблемы, решение и проблемы, которые возникли в ходе разработки решения.
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
With so much talk of how Big Data is revolutionizing the world and how a data lake with Hadoop and/or Spark will solve all your data problems, it is hard to tell what is hype, reality, or somewhere in-between.
In working with dozens of enterprises in varying stages of their enterprise data management (EDM) strategy, MongoDB enterprise architect, Matt Kalan, sees the same challenges and misunderstandings arise again and again.
In this session, he will explain common challenges in data management, what capabilities are necessary, and what the future state of architecture looks like. MongoDB is uniquely capable of filling common gaps in the data lake strategy.
This session also includes a live Q&A portion during which you are encouraged to ask questions of our team.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
High cardinality time series search: A new level of scale - Data Day Texas 2016Eric Sammer
Modern search systems provide incredible feature sets, developer-friendly APIs, and low latency indexing and query response. By some measures, these systems operate "at scale," but rarely is that quantified. Customers of Rocana typically look to push ingest rates in excess of 1 million events per second, retaining years of data online for query, with the expectation of sub-second response times for any reasonably sized subset of data.
We quickly found that the tradeoffs made by general purpose search systems, while right for common use cases, were less appropriate for these high cardinality, large scale use cases.
This session details the architecture, tradeoffs, and interesting implementation decisions made in building a new time series optimized distributed search system using Apache Lucene, Kafka, and HDFS. Data ingestion and durability, index and metadata organization, storage, query scheduling and optimization, and failure modes will be covered. Finally, a summary of the results achieved will be shown.
Two popular tools for doing Machine Learning on top of JVM ecosystem is H2O and SparkML. This presentation compares these two tools as Machine Learning libraries (Didn't consider Spark's Data Munjing perspective). This work was done during June of 2018.
Stream Processing with Apache Kafka and .NETconfluent
Presentation from South Bay.NET meetup on 3/30.
Speaker: Matt Howlett, Software Engineer at Confluent
Apache Kafka is a scalable streaming platform that forms a key part of the infrastructure at many companies including Uber, Netflix, Walmart, Airbnb, Goldman Sachs and LinkedIn. In this talk Matt will give a technical overview of Kafka, discuss some typical use cases (from surge pricing to fraud detection to web analytics) and show you how to use Kafka from within your C#/.NET applications.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
At this workshop, you will build your own messaging insights system - data ingestion from a live data source (Reddit), queueing, deploying a machine learning model, and serving messages with insights to your mobile phone!
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
In the same way that we need to make assertions about how code functions, we need to make assertions about data, and unit testing is a promising framework. In this talk, we'll explore what is unique about unit testing data, and see how Two Sigma's open source library Marbles addresses these unique challenges in several real-world scenarios.
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of jobs. The key takeaway is that these models can enrich analysis of specialized datasets.
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
How many newspapers should be distributed to each store for sale every day? The data science group at The New York Times addresses this optimization problem using custom time series modeling and analytical solutions, while also incorporating qualitative business concerns. I'll describe our modeling and data engineering approaches, written in Python and hosted on Google Cloud Platform.
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
However, the the graph theory jargon can make graph analytics seem more intimidating for self-study than is necessary. In this talk, the audience will be exposed to some of the basic concepts of graph theory (no prerequisite math knowledge needed!) and a few of the Python tools available for graph analysis.
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
Those of us who use TensorFlow often focus on building the model that's most predictive, not the one that's most deployable. So how to put that hard work to work? In this talk, we'll walk through a strategy for taking your machine learning models from Jupyter Notebook into production and beyond.
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
In September 2017, dockless bikeshare joined the transportation options in the District of Columbia. In March 2018, scooter share followed. During the pilot of these technologies, Python has helped District Department of Transportation answer some critical questions. This talk will discuss how Python was used to answer research questions and how it supported the evaluation of this demonstration.
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
There are many stories of developers creating databases that don't operate at scale. The application is good, but the database won't work the realistic volumes of data. It's like a horror movie where they never looked behind the door, ran into the dark forest and night, and discovered the database was the monster killing their application. How can we leverage Python to avoid scaling problems?
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. In this session, I'll demonstrate how one-off experiments can be transformed into scalable ML pipelines with minimal effort.
We will be using Beautiful Soup to Webscrape the IMDB website and create a function that will allow you to create a dictionary object on specific metadata of the IMDB profile for any IMDB ID you pass through as an argument.
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
This talk describes an experimental approach to time series modeling using 1D convolution filter layers in a neural network architecture. This approach was developed at System1 for forecasting marketplace value of online advertising categories.
Extending Pandas with Custom Types - Will AydPyData
Pandas v.0.23 brought to life a new extension interface through which you can extend NumPy's type system. This talk will explain what that means in more detail and provide practical examples of how the new interface can be leveraged to drastically improve your reporting.
Machine learning models are increasingly used to make decisions that affect people’s lives. With this power comes a responsibility to ensure that model predictions are fair. In this talk I’ll introduce several common model fairness metrics, discuss their tradeoffs, and finally demonstrate their use with a case study analyzing anonymized data from one of Civis Analytics’s client engagements.
What's the Science in Data Science? - Skipper SeaboldPyData
The gold standard for validating any scientific assumption is to run an experiment. Data science isn’t any different. Unfortunately, it’s not always possible to design the perfect experiment. In this talk, we’ll take a realistic look at measurement using tools from the social sciences to conduct quasi-experiments with observational data.
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
Forecasting time-series data has applications in many fields, including finance, health, etc. There are potential pitfalls when applying classic statistical and machine learning methods to time-series problems. This talk will give folks the basic toolbox to analyze time-series data and perform forecasting using statistical and machine learning models, as well as interpret and convey the outputs.
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
A historical text may now be unreadable, because its language is unknown, or its script forgotten (or both), or because it was deliberately enciphered. Deciphering needs two steps: Identify the language, then map the unknown script to a familiar one. I’ll present an algorithm to solve a cartoon version of this problem, where the language is known, and the cipher is alphabet rearrangement.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
1. Frontera: open source, large scale web
crawling framework
Alexander Sibiryakov, May 20, 2016, PyData Berlin 2016
sibiryakov@scrapinghub.com
2. • Software Engineer @
Scrapinghub
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
About myself
2
3. • Over 2 billion requests per month
(~800/sec.)
• Focused crawls & Broad crawls
We help turn web content
into useful data
3
{
"content": [
{
"title": {
"text": "'Extreme poverty' to fall below
10% of world population for first time",
"href": "http://www.theguardian.com/
society/2015/oct/05/world-bank-extreme-
poverty-to-fall-below-10-of-world-population-
for-first-time"
},
"points": "9 points",
"time_ago": {
"text": "2 hours ago",
"href": "https://news.ycombinator.com/
item?id=10352189"
},
"username": {
"text": "hliyan",
"href": "https://news.ycombinator.com/
user?id=hliyan"
}
},
4. Broad crawl usages
4
Lead generation
(extracting contact
information)
News analysis
Topical crawling
Plagiarism detection
Sentiment analysis
(popularity, likability)
Due diligence (profile/
business data)
Track criminal activity & find lost persons (DARPA)
5. Saatchi Global Gallery Guide
www.globalgalleryguide.com
• Discover 11K online
galleries.
• Extract general
information, art samples,
descriptions.
• NLP-based extraction.
• Find more galleries on the
web.
6. Frontera recipes
• Multiple websites data collection automation
• «Grep» of the internet segment
• Topical crawling
• Extracting data from arbitrary document
7. Multiple websites data collection
automation
• Scrapers from multiple websites.
• Data items collected and updated.
• Frontera can be used to
• crawl in parallel and scale the process,
• schedule revisiting (within fixed time),
• prioritize the URLs during crawling.
8. «Grep» of the internet segment
• alternative to Google,
• collect the zone files from registrars
(.com/.net/.org),
• setup Frontera in distributed mode,
• implement text processing in spider code,
• output items with matched pages.
9. Topical crawling
• document topic classifier & seeds URL list,
• if document is classified as positive crawler ->
extracted links,
• Frontera in distributed mode,
• topic classifier code put in spider.
Extensions: link classifier, follow/final classifiers.
10. Extracting data from arbitrary
document
• Tough problem. Can’t be solved completely.
• Can be seen as a structured prediction problem:
• Conditional Random Fields (CRF) or
• Hidden Markov Models (HMM).
• Tagged sequence of tokens and HTML tags can be
used to predict the data fields boundaries.
• Webstruct and WebAnnotator Firefox extension.
11. Task
• Spanish web: hosts and
their sizes statistics.
• Only .es ccTLD.
• Breadth-first strategy:
• first 1-click environ,
• 2,
• 3,
• …
• Finishing condition: 100 docs
from host max., all hosts
• Low costs.
11
12. Spanish, Russian, German and
world Web in 2012
12
Domains Web servers Hosts DMOZ*
Spanish (.es) 1,5M 280K 4,2M 122K
Russian
(.ru, .рф, .su) 4,8M 2,6M ? 105K
German (.de) 15,0M 3,7M 20,4M 466K
World 233M 62M 890M 3,9M
Sources: OECD Communications Outlook 2013, statdom.ru
* - current period (October 2015)
13. Solution
• Scrapy (based on Twisted) - network operations.
• Apache Kafka - data bus (offsets, partitioning).
• Apache HBase - storage (random access, linear
scanning, scalability).
• Twisted.Internet - library for async primitives for use in
workers.
• Snappy - efficient compression algorithm for IO-
bounded applications.
13
15. 1. Big and small hosts
problem
• Queue is flooded with
URLs from the same
host.
• → underuse of spider
resources.
• additional per-host
(per-IP) queue and
metering algorithm.
• URLs from big hosts
are cached in memory.
15
16. 2. DDoS DNS service Amazon AWS
Breadth-first strategy →
first visiting of unknown
hosts →
generating huge amount
of DNS reqs.
Recursive DNS server
• on every spider node,
• upstream to Verizon &
OpenDNS.
We used dnsmasq.
16
17. 3. Tuning Scrapy thread pool’а
for efficient DNS resolution
• OS DNS resolver,
• blocking calls,
• thread pool to resolve
DNS name to IP.
• numerous errors and
timeouts 🆘
• A patch for thread
pool size and
timeout adjustment.
17
18. 4. Overloaded HBase region
servers during state check
• 10^3 links per doc,
• state check: CRAWLED/NOT
CRAWLED/ERROR,
• HDDs.
• Small volume 🆗
• With ⬆table size, response
times ⬆
• Disk queue ⬆
• Host-local fingerprint
function for keys in HBase.
• Tuning HBase block cache to
fit average host states into
one block.
18
3Tb of metadata.
URLs, timestamps,…
275 b/doc
19. 5. Intensive network traffic
from workers to services
• Throughput
between workers
and Kafka/HBase
~ 1Gbit/s.
• Thrift compact
protocol for HBase
• Message
compression in
Kafka with Snappy
19
20. 6. Further query and traffic
optimizations to HBase
• State check: lots of
reqs and network
• Consistency
• Local state cache
in strategy worker.
• For consistency,
spider log was
partitioned by
host.
20
21. State cache
• All ops are batched:
– If no key in cache→
read HBase
– every ~4K docs →
flush
• Close to 3M (~1Gb)
elms → flush & cleanup
• Least-Recently-Used
(LRU) 👍
21
22. Spider priority queue (slot)
• Cell:
Array of:
- fingerprint,
- Crc32(hostname),
- URL,
- score
• Dequeueing top N.
• Prone to huge hosts
• Scoring model: document
count per host.
22
23. 7. Problem of big and small
hosts (strikes back!)
• Discovered few very
huge hosts (>20M
docs)
• All queue partitions
were flooded with huge
hosts,
• Two MapReduce jobs:
– queue shuffling,
– limit all hosts to
100 docs MAX.
23
24. Spanish (.es) internet crawl results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es,
druni.es,
docentesconeducacion.es -
are the biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than
50M pages
24
29. • Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
(distributed backend+spiders)
29
30. Software requirements
30
Single process Distributed spiders
Distributed
backend
Python 2.7+, Scrapy 1.0.4+
sqlite or any other
RDBMS
HBase/RDBMS
- ZeroMQ or Kafka
- - DNS Service
31. • Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Run modes: single process, distributed spiders,
dist. backend
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Main features
31
32. • Message bus abstraction (ZeroMQ and Kafka are
available out-of-the box).
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate module.
• Polite by design: each website is downloaded by at most
one spider.
• Canonical URLs resolution abstraction: each document
has many URLs, which to use?
• Python: workers, spiders.
Main features
32