Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
"R is the most popular language in the data-science community with 2+ million users and 6000+ R packages. R’s adoption evolved along with its easy-to-use statistical language, graphics, packages, tools and active community. In this session we will introduce Distributed R, a new open-source technology that solves the scalability and performance limitations of vanilla R. Since R is single-threaded and does not scale to accommodate large datasets, Distributed R addresses many of R’s limitations. Distributed R efficiently shares sparse structured data, leverages multi-cores, and dynamically partitions data to mitigate load imbalance.
In this talk, we will show the promise of this approach by demonstrating how important machine learning and graph algorithms can be expressed in a single framework and are substantially faster under Distributed R. Additionally, we will show how Distributed R complements Vertica, a state-of-the-art columnar analytics database, to deliver a full-cycle, fully integrated, data “prep-analyze-deploy” solution."
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.
This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
Learn how Drill achieves high performance with flexibility and ease of use. Includes: First read planning and statistics. Flexible code generation depending on workload. Code optimization and planning techniques. Dynamic schema subsets. Advanced memory use and moving between Java and C. Making a static typing appear dynamic through any-time and multi-phase planning.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
"R is the most popular language in the data-science community with 2+ million users and 6000+ R packages. R’s adoption evolved along with its easy-to-use statistical language, graphics, packages, tools and active community. In this session we will introduce Distributed R, a new open-source technology that solves the scalability and performance limitations of vanilla R. Since R is single-threaded and does not scale to accommodate large datasets, Distributed R addresses many of R’s limitations. Distributed R efficiently shares sparse structured data, leverages multi-cores, and dynamically partitions data to mitigate load imbalance.
In this talk, we will show the promise of this approach by demonstrating how important machine learning and graph algorithms can be expressed in a single framework and are substantially faster under Distributed R. Additionally, we will show how Distributed R complements Vertica, a state-of-the-art columnar analytics database, to deliver a full-cycle, fully integrated, data “prep-analyze-deploy” solution."
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.
This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
Learn how Drill achieves high performance with flexibility and ease of use. Includes: First read planning and statistics. Flexible code generation depending on workload. Code optimization and planning techniques. Dynamic schema subsets. Advanced memory use and moving between Java and C. Making a static typing appear dynamic through any-time and multi-phase planning.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The term "Data Lake" has become almost as overused and undescriptive as "Big Data". Many believe that centralizing datasets in HDFS makes a data lake, but then they struggle to realize any tangible value. This talk will redefine the "Data Lake" by describing four specific, key characteristics that we at Koverse have learned are crucial to successful enterprise data lake deployments. These characteristics are 1) indexing and search across all data sets, 2) interactive access for all users in the enterprise, 3) multi-level access control, and 4) integration with data science tools. These characteristics define a system that lets people realize value from their data versus getting lost in the hype. The talk will go on to provide a technical description of how we have integrated several projects, namely Apache Accumulo, Hadoop, and Spark, to implement an enterprise data lake with these key features.
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
Introduction to Vertica (Architecture & More)LivePerson
LivePersonDev is happy to host this meetup with Zvika Gutkin, an Oracle and Vertica expert DBA in LivePerson, and specialist in BI and Big Data.
At LivePerson, we handle enormous amounts of data. We use Vertica to analyse this data in real time.
In this lecture Zvika will cover the following:
1. Present the architecture of Vertica
2. Compare row store to column store
3. Explain how Vertica achieve Fast query time
4. Show few use cases .
5. Explains what does Liveperson do with Vertica? Why we chose Vertica?
6. Talk about why we Love Vertica and Why we hate it .
7. Is Vertica SQL DB or NoSQL? Is vertica Consistent or Eventually consistent?
8. How Vertica differ from other SQL and noSQL technologies?
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The term "Data Lake" has become almost as overused and undescriptive as "Big Data". Many believe that centralizing datasets in HDFS makes a data lake, but then they struggle to realize any tangible value. This talk will redefine the "Data Lake" by describing four specific, key characteristics that we at Koverse have learned are crucial to successful enterprise data lake deployments. These characteristics are 1) indexing and search across all data sets, 2) interactive access for all users in the enterprise, 3) multi-level access control, and 4) integration with data science tools. These characteristics define a system that lets people realize value from their data versus getting lost in the hype. The talk will go on to provide a technical description of how we have integrated several projects, namely Apache Accumulo, Hadoop, and Spark, to implement an enterprise data lake with these key features.
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
Introduction to Vertica (Architecture & More)LivePerson
LivePersonDev is happy to host this meetup with Zvika Gutkin, an Oracle and Vertica expert DBA in LivePerson, and specialist in BI and Big Data.
At LivePerson, we handle enormous amounts of data. We use Vertica to analyse this data in real time.
In this lecture Zvika will cover the following:
1. Present the architecture of Vertica
2. Compare row store to column store
3. Explain how Vertica achieve Fast query time
4. Show few use cases .
5. Explains what does Liveperson do with Vertica? Why we chose Vertica?
6. Talk about why we Love Vertica and Why we hate it .
7. Is Vertica SQL DB or NoSQL? Is vertica Consistent or Eventually consistent?
8. How Vertica differ from other SQL and noSQL technologies?
Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.
In this age of Big Data, data volumes grow exceedingly larger while the technical problems and business scenarios become more complex. Compounding these complexities, data consumers are demanding faster analysis to common business questions asked of their Big Data. This session provides concrete examples of how to address this challenge. We will highlight the use of Big Data technologies—including Hadoop and Hive —with classic BI systems such as SQL Server Analysis Services.
Session takeaways:
• Understand the architectural components surrounding Hadoop, Hive, Classic BI, and the Tier-1 BI ecosystem
• Get strategies for addressing the technical issues when working with extremely large cubes
• See how to address the technical issues when working with Big Data systems from the DBA perspective
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
Updated from the Hadoop Summit slides (http://www.slideshare.net/Hadoop_Summit/klout-changing-landscape-of-social-media), we've included additional screenshots to help tell the whole story.
SnapLogic provides a Data Integration platform that takes integration to another level, by combining the power of dynamic programming languages with standard Web interfaces to solve today's most pressing problems in application integration. SnapLogic has an intuitive visual designer that runs in your browser and connects to highly scalable web based Integration server that you can run on premise or in the cloud.
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale
Is your database holding back your application? Find out how we at RightScale use SQL and NoSQL databases such as MySQL and Cassandra to provide a scalable, distributed, and highly available service around the world, that is designed to recover from failures of a whole cloud region.
In this webinar, we will:
- Share the data taxonomy for specific RightScale systems
- Give you insights on how to think through your own data taxonomy
- Go deep into RightScale's distributed database architecture
Join RightScale's VP of Engineering and Chief Architect and learn directly from the team who architected RightScale's databases for scale, HA and DR.
In this slidecast, Jim Kaskade from Infochimps presents: Cloud for Big Data.
"Infochimps was founded by data scientists and cloud computing experts. Our solutions make it faster, easier and far less complex to build and manage Big Data systems behind applications to quickly deliver actionable insights. With Infochimps Cloud, enterprises benefit from the fastest way to deploy Big Data applications in complex, hybrid cloud environments."
Learn more at:
http://infochimps.com
View the presentation video:
http://inside-bigdata.com/slidecast-cloud-for-big-data/
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationACSG Section Montréal
The U.S. National Science Foundation supported Consortium of Universities for the Advancement of Hydrologic Sciences (CUAHSI) Hydrologic Information System (HIS) project includes extensive development of data storage and delivery tools and standards including WaterML (a language for sharing hydrologic data sets via web services); and HIS Server (a software tool set for delivering WaterML from a server); These and other CUASHI HIS tools have been under development and deployment for several years and together, present a relatively complete software “stack” to support the consistent storage and delivery of hydrologic and other environmental observation data. This presentation describes the development of a new HIS software tool called “HIS Desktop” and the development of an online open source software development community to update and maintain the software. HIS Desktop is envisioned as a local (i.e. not server-based) client side software tool that will run on multiple operating systems and will provide a highly usable level of access to HIS services. The software will provide many key capabilities including data query, map-based visualization, data download, local data maintenance, editing, graphing, data export to selected model-specific data formats, linkage with integrated modeling systems such as OpenMI, and potentially upload to the HIS server from the local desktop software. As the software is presently in the early stages of development, this presentation will focus on design approach and paradigm and is viewed as an opportunity to encourage participation in the open development community. Indeed, recognizing the value of community based code development as a means of ensuring end-user adoption, this project has adopted an “iterative” or “spiral” software development approach which will be described in this presentation.
Similar to Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter (20)
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
1. Hadoop and Vertica
The Data Analytics Platform at Twitter
Bill Graham - @billgraham
Data Systems Engineer, Analytics Infrastructure
Hadoop Summit, June 2012
4. We count things
• 140 characters
• 140M active users
• 400M tweets per day
• 80-100 TB ingested daily (uncompressed)
• 10s of Ks daily Hadoop jobs
4
6. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Staging Hadoop Cluster
Main Hadoop DW HBase Analytics
Vertica
Web Tools
MySQL
6
7. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Log
Mover
Main Hadoop DW HBase Analytics
Vertica
Web Tools
MySQL
6
8. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log
Mover
Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
MySQL
6
9. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
6
10. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log Rasvelg
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
6
11. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
6
12. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
HCatalog Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
6
14. System concepts
• Loose coupling
• Job coordination as a service
• Resource management as a service
• Idempotence
8
15. Loose coupling
• Multiple job frameworks
• Right tool for the job
• Common dependency management
9
16. Job coordination
• Shared batch table for job state
• Access via client libraries
• Jobs & data are time-based
• 3 types of preconditions
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists)
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually)
10
17. Job coordination
• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)
• Jobs & data are time-based
• 3 types of preconditions
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists)
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually)
10
18. Job coordination
• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)
• Jobs & data are time-based
• 3 types of preconditions
Job
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists)
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually)
10
19. Job coordination
• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)
• Jobs & data are time-based
• 3 types of preconditions
Job
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists) Data
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually)
10
20. Job coordination
• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)
• Jobs & data are time-based
• 3 types of preconditions
Job
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists) Data
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually) ?
10
21. Resource management
• Analytics Resource Manager - ARM!
• Library above Zookeeper
• Throttles jobs and workers
• Only 1 job of this name may run at once
• Only N jobs may be run by this app at once
• Only M mappers may write to Vertica at once
11
22. Resource management
• Analytics Resource Manager - ARM!
• Library above Zookeeper
• Throttles jobs and workers
• Only 1 job of this name may run at once
• Only N jobs may be run by this app at once
• Only M mappers may write to Vertica at once
11
23. Resource management
• Analytics Resource Manager - ARM!
• Library above Zookeeper
• Throttles jobs and workers
• Only 1 job of this name may run at once
• Only N jobs may be run by this app at once
• Only M mappers may write to Vertica at once
11
24. Job DAG & state transition
“Local View”
• Is it time for me to run yet?
• Are my dependancies satisfied?
• Any resource constraints?
12
25. Job DAG & state transition
“Local View”
• Is it time for me to run yet?
• Are my dependancies satisfied?
• Any resource constraints?
granted
denied Insert entry into
batch table
no
Idle yes Completion
Execution
Complete?
Execution
12
26. Job DAG & state transition
“Local View”
• Is it time for me to run yet?
• Are my dependancies satisfied?
• Any resource constraints?
granted
denied Insert entry into
batch table
no
Idle yes Completion
Execution
Complete?
Execution
batch table:
(id, description, state,
start_time, end_time,
job_start_time, job_end_time)
12
27. Example: active users
Production Hosts
Main Hadoop DW
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
13
28. Example: active users
Job DAG
Log mover
Production Hosts
Log mover
(via staging cluster)
ib e web_events
Scr
Main Hadoop DW
Scr
ibe sms_events
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
13
29. Example: active users
Job DAG
Oink
Log mover
Production Hosts
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
13
30. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
13
31. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Crane
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Crane Analytics
Gizzard MySQL Dashboards
user_profiles Vertica
13
32. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Crane
Log mover Rasvelg
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Crane Analytics
Gizzard MySQL Dashboards
user_profiles Vertica
Rasvelg
Join,
Join Group, Count
Aggregations:
- active_by_geo
- active_by_device
- active_by_client
... 13
34. Vertica or Hadoop?
• Vertica
• Loads 100s of Ks rows/second
• Aggregate 100s of Ms rows in seconds
• Used for low latency queries and aggregations
• Keep a sliding window of data
• Hadoop
• Excels when data size is massive
• Flexible and powerful
• Great with nested data structures and unstructured data
• Used for complex functions and ML
14
35. Vertica import options
• Direct import via Crane
• Load into dest table, single thread
• Atomic import via Crane/Rasvelg
• Crane loads to temp table, single thread
• Rasvelg moves to dest table
• Parallel import via Oink/Pig
• Pig job via VerticaStorer
MySQL/
Gizzard
• ARM throttles active DB connections Crane
Rasvelg
Oink
Main Hadoop DW
Vertica
Crane
15
36. Vertica imports - pros/cons
• Crane & Rasvelg
• Good for smaller datasets, DB to DB transfers
• Single threaded
• Easy on Vertica
• Hadoop not required
• Pig
• Great for larger datasets MySQL/
Gizzard
• More complex, not atomic
Crane
• DDOS potential Rasvelg
Oink
Main Hadoop DW
Vertica
Crane
16
37. VerticaStorer
• PigStorage implementation
• From Vertica’s Hadoop connector suite
• Out of the box
• Easy to get Hello World working
• Well documented
• Pig/Vertica data bindings work well
• Fast!
• Transaction-aware tasks
• No bugs found
• Open source?
17
38. Pig VerticaStorage
• Our enhancements
• Connection credential management
• Truncate before load option
• Throttle concurrent writers via ZK
• Future features
• Counters for rows inserted/rejected
• Name-based tuple-column bindings
• Atomic load via temp table
18
39. Pig VerticaStorage
• Our enhancements
• Connection credential management
• Truncate before load option
• Throttle concurrent writers via ZK
• Future features
• Counters for rows inserted/rejected
• Name-based tuple-column bindings
• Atomic load via temp table
SET mapred.map.tasks.speculative.execution false
user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’;
STORE user_sessions INTO '{db_schema.user_sessions}' USING
com.twitter.twadoop.pig.store.VerticaStorage(
'config/db.yml', 'db_name', 'arm_resource_name');
18
40. Gotcha #1
• MR data load is not atomic
• Avoid partial reads
• Option 1: load to temp table, then insert direct
• Option 2: add job dependency concept
19
41. Gotcha #2
• Speculative execution is not always your friend
• Launch more tasks than needed, just in case
• For non-idempotent jobs, extra tasks == BAD
20
42. Gotcha #3
• isIdempotant() must be a first-class concept
• Loader jobs will fail
• Failure after first task success == not good
• Can’t automate retry without cleanup
21
43. Gotcha #4
• Vendor code only gets you so far
• Nice to haves == have to write
• Favor the decorator pattern
• Pig’s StoreFuncWrapper can help
• Vendor open sourcing is ideal
22
44. Future work
• More VerticaStorer features
• Multiple Vertica clusters
• Atomic DB loads with Pig/Oink
• Better DAG visibility
• Better job history visibility
• MR job optimizations via historic stats
• HCatalog data registry
• Job push events
23