Integrating Hadoop in your existing data warehouse and business intelligence environment. Speakers Jeff Hammerbacher, Cloudera and Anil Madan, eBay.
Recording of webinar on https://www1.gotomeeting.com/register/515000760
How To Leverage OBIEE Within A Big Data ArchitectureKevin McGinley
If you've invested in OBIEE and want to start exploring the use of Big Data technology, this presentation talks about how and why you might want to use OBIEE as the common visualization layer across both.
Reports and DITA Metrics IXIASOFT User Conference 2016IXIASOFT
Reports and DITA Metrics IXIASOFT presentation at the IXIASOFT User Conference 2016, by Keith Schengili-Roberts, DITA Information Architect, IXIASOFT, Nathalie Laroche, Lead Technical Writer, IXIASOFT and Dustin Clark, Lead DITA Architect, Intel
See some common myths, discover the various open source enterprise search packages available and see some case studies on how open source software has helped organisations build effective search.
Search and Recommendations: 3 Sides of the Same CoinNick Pentreath
Recommendation engines are one of the most well-known, widely-used and highest value use cases for applied machine learning. Search and recommender systems are closely linked, often co-existing and intermingling. Indeed, modern search applications at scale typically involve significant elements of machine learning, while personalization systems rely heavily on and are deeply integrated with search engines. In this session, I will explore this link between search and recommendations.
In particular, I will cover three of the most common approaches for using search engines to serve personalized recommendation models. I call these the "score then search", "native search" and "custom ranking" approaches. I will detail each approach, comparing it with the others in terms of various considerations important for production systems at scale, including the architecture, schemas, performance, quality and flexibility aspects. Finally, I will also contrast these model-based approaches with what is achievable using pure search.
How To Leverage OBIEE Within A Big Data ArchitectureKevin McGinley
If you've invested in OBIEE and want to start exploring the use of Big Data technology, this presentation talks about how and why you might want to use OBIEE as the common visualization layer across both.
Reports and DITA Metrics IXIASOFT User Conference 2016IXIASOFT
Reports and DITA Metrics IXIASOFT presentation at the IXIASOFT User Conference 2016, by Keith Schengili-Roberts, DITA Information Architect, IXIASOFT, Nathalie Laroche, Lead Technical Writer, IXIASOFT and Dustin Clark, Lead DITA Architect, Intel
See some common myths, discover the various open source enterprise search packages available and see some case studies on how open source software has helped organisations build effective search.
Search and Recommendations: 3 Sides of the Same CoinNick Pentreath
Recommendation engines are one of the most well-known, widely-used and highest value use cases for applied machine learning. Search and recommender systems are closely linked, often co-existing and intermingling. Indeed, modern search applications at scale typically involve significant elements of machine learning, while personalization systems rely heavily on and are deeply integrated with search engines. In this session, I will explore this link between search and recommendations.
In particular, I will cover three of the most common approaches for using search engines to serve personalized recommendation models. I call these the "score then search", "native search" and "custom ranking" approaches. I will detail each approach, comparing it with the others in terms of various considerations important for production systems at scale, including the architecture, schemas, performance, quality and flexibility aspects. Finally, I will also contrast these model-based approaches with what is achievable using pure search.
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
Applied Semantic Search with Microsoft SQL ServerMark Tabladillo
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Mr. Slim Baltagi is a Systems Architect at Hortonworks, with over 4 years of Hadoop experience working on 9 Big Data projects: Advanced Customer Analytics, Supply Chain Analytics, Medical Coverage Discovery, Payment Plan Recommender, Research Driven Call List for Sales, Prime Reporting Platform, Customer Hub, Telematics, Historical Data Platform; with Fortune 100 clients and global companies from Financial Services, Insurance, Healthcare and Retail.
Mr. Slim Baltagi has worked in various architecture, design, development and consulting roles at.
Accenture, CME Group, TransUnion, Syntel, Allstate, TransAmerica, Credit Suisse, Chicago Board Options Exchange, Federal Reserve Bank of Chicago, CNA, Sears, USG, ACNielsen, Deutshe Bahn.
Mr. Baltagi has also over 14 years of IT experience with an emphasis on full life cycle development of Enterprise Web applications using Java and Open-Source software. He holds a master’s degree in mathematics and is an ABD in computer science from Université Laval, Québec, Canada.
Languages: Java, Python, JRuby, JEE , PHP, SQL, HTML, XML, XSLT, XQuery, JavaScript, UML, JSON
Databases: Oracle, MS SQL Server, MYSQL, PostreSQL
Software: Eclipse, IBM RAD, JUnit, JMeter, YourKit, PVCS, CVS, UltraEdit, Toad, ClearCase, Maven, iText, Visio, Japser Reports, Alfresco, Yslow, Terracotta, Toad, SoapUI, Dozer, Sonar, Git
Frameworks: Spring, Struts, AppFuse, SiteMesh, Tiles, Hibernate, Axis, Selenium RC, DWR Ajax , Xstream
Distributed Computing/Big Data: Hadoop, MapReduce, HDFS, Hive, Pig, Sqoop, HBase, R, RHadoop, Cloudera CDH4, MapR M7, Hortonworks HDP 2.1
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
Family tree of data – provenance and neo4jM. David Allen
Discusses data provenance and how it can be implemented in neo4j, as well as many lessons learned about the relative strengths and weaknesses of relational and graph databases.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
Speaker: Philippe Mizrahi - Associate Product Manager - Lyft
Abstract: Philippe Mizrahi works on Lyft’s data discovery and metadata engine, Amundsen. With the help of a Neo4j graph database, Amundsen has improved Lyft’s data discovery by reducing time to discover data by 10x.
During this session, Philippe will dive deep into Amundsen’s use cases, impact, and architecture, which effectively combines a comprehensive knowledge graph based upon Neo4j, centralized metadata and other search ranking optimizations to discover data quickly.
Applied Semantic Search with Microsoft SQL ServerMark Tabladillo
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Mr. Slim Baltagi is a Systems Architect at Hortonworks, with over 4 years of Hadoop experience working on 9 Big Data projects: Advanced Customer Analytics, Supply Chain Analytics, Medical Coverage Discovery, Payment Plan Recommender, Research Driven Call List for Sales, Prime Reporting Platform, Customer Hub, Telematics, Historical Data Platform; with Fortune 100 clients and global companies from Financial Services, Insurance, Healthcare and Retail.
Mr. Slim Baltagi has worked in various architecture, design, development and consulting roles at.
Accenture, CME Group, TransUnion, Syntel, Allstate, TransAmerica, Credit Suisse, Chicago Board Options Exchange, Federal Reserve Bank of Chicago, CNA, Sears, USG, ACNielsen, Deutshe Bahn.
Mr. Baltagi has also over 14 years of IT experience with an emphasis on full life cycle development of Enterprise Web applications using Java and Open-Source software. He holds a master’s degree in mathematics and is an ABD in computer science from Université Laval, Québec, Canada.
Languages: Java, Python, JRuby, JEE , PHP, SQL, HTML, XML, XSLT, XQuery, JavaScript, UML, JSON
Databases: Oracle, MS SQL Server, MYSQL, PostreSQL
Software: Eclipse, IBM RAD, JUnit, JMeter, YourKit, PVCS, CVS, UltraEdit, Toad, ClearCase, Maven, iText, Visio, Japser Reports, Alfresco, Yslow, Terracotta, Toad, SoapUI, Dozer, Sonar, Git
Frameworks: Spring, Struts, AppFuse, SiteMesh, Tiles, Hibernate, Axis, Selenium RC, DWR Ajax , Xstream
Distributed Computing/Big Data: Hadoop, MapReduce, HDFS, Hive, Pig, Sqoop, HBase, R, RHadoop, Cloudera CDH4, MapR M7, Hortonworks HDP 2.1
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
Family tree of data – provenance and neo4jM. David Allen
Discusses data provenance and how it can be implemented in neo4j, as well as many lessons learned about the relative strengths and weaknesses of relational and graph databases.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
Hadoop World 2011: Preview of the New Cloudera Management Suite - Phil Zeylig...Cloudera, Inc.
This session will preview what is new in the latest release of the Cloudera Management Suite. We will cover the common problems we've seen in Hadoop management and will do a demonstration of several new features designed to address these problems.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
This is a technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.
Presenter Marcel Kornacker, creator of Impala, begins with an overview of Impala from the user's perspective, followed by an overview of Impala's architecture and implementation, and will conclude with a comparison of Impala with Dremel and Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.
Open data is a crucial prerequisite for inventing and disseminating the innovative practices needed for agricultural development. To be usable, data must not just be open in principle—i.e., covered by licenses that allow re-use. Data must also be published in a technical form that allows it to be integrated into a wide range of applications. The webinar will be of interest to any institution seeking ways to publish and curate data in the Linked Data cloud.
This webinar describes the technical solutions adopted by a widely diverse global network of agricultural research institutes for publishing research results. The talk focuses on AGRIS, a central and widely-used resource linking agricultural datasets for easy consumption, and AgriDrupal, an adaptation of the popular, open-source content management system Drupal optimized for producing and consuming linked datasets.
Agricultural research institutes in developing countries share many of the constraints faced by libraries and other documentation centers, and not just in developing countries: institutions are expected to expose their information on the Web in a re-usable form with shoestring budgets and with technical staff working in local languages and continually lured by higher-paying work in the private sector. Technical solutions must be easy to adopt and freely available.
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
Challenges in Data Analytics:
Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
The millions of people that use Spotify each day generate a lot of data, roughly a few terabytes per day. What does it take to handle datasets of that scale, and what can be done with it? I will briefly cover how Spotify uses data to provide a better music listening experience, and to strengthen their busineess. Most of the talk will be spent on our data processing architecture, and how we leverage state of the art data processing and storage tools, such as Hadoop, Cassandra, Kafka, Storm, Hive, and Crunch. Last, I'll present observations and thoughts on innovation in the data processing aka Big Data field.
14 Tips for Planning ECM Content Migration to SharePointJoel Oleson
• Is your organization using any Enterprise Content Management systems besides SharePoint?
• Has your current ECM system been deprecated or require an expensive annual maintenance contract?
• Does your firm already use Microsoft SharePoint as an intranet/collaboration portal?
• Would you like to leverage the cutting edge ECM and taxonomy features in SharePoint 2010 or 2013?
If so, it may be time to migrate your scanned and other transactional documents from your legacy ECM system to SharePoint, and take advantage of innovative ECM and taxonomy features available in this powerful platform.
Top industry experts and influencers Joel Oleson and Tom Castiglia from Hershey Technologies will explain best practices to plan and implement a successful ECM content migration to Microsoft SharePoint. Join us on June 4th to learn about:
• Using ECM and taxonomy related features in SharePoint 2010 and 2013
• Reasons to migrate content to SharePoint (as well as reasons not to migrate!)
• Best practices for ECM architecture in SharePoint including security, taxonomy and governance
• 14 specific tips for planning your migration from any ECM system to SharePoint
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.
According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
The Briefing Room with Dr. Robin Bloor and Think Big, a Teradata Company
Live Webcast April 7, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=4114b87441ab7b2b4c52f6b24776e5a1
The more things change in Big Data, the more they stay the same. Indeed, there are many similarities between a Hadoop-based Data Lake and today’s modern Data Warehouse. Regardless of platform, information workers must still be able to turn their assets into action quickly, without taking a hit on governance or downstream performance.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains the challenges facing organizations who endeavor on Big Data projects. He’ll be briefed by Rick Stellwagen of Think Big, a Teradata Company, who will outline his company’s approach to handling Big Data implementations. Rick will discuss the role of the data lake, and how timely response of queries is critical for reporting and analysis.
Visit InsideAnalysis.com for more information.
Streaming ETL for Data Lakes using Amazon Kinesis Firehose - May 2017 AWS Onl...Amazon Web Services
Learning Objectives:
- Understand key requirements for collecting, preparing, and loading streaming data into data lakes
- Get an overview of transmitting data using Amazon Kinesis Firehose
- Learn how to perform data transformations with Amazon Kinesis Firehose
Data lakes enable your employees across the organization to access and analyze massive amounts of unstructured and structured data from disparate data sources, many of which generate data continuously and rapidly. Making this data available in a timely fashion for analysis requires a streaming solution that can durably and cost-effectively ingest this data into your data lake. Amazon Kinesis Firehose is a fully managed service that makes it easy to prepare and load streaming data into AWS. In this tech talk, we will provide an overview of Amazon Kinesis Firehose and dive deep into how you can use the service to collect, transform, batch, compress, and load real-time streaming data into your Amazon S3 data lakes.
apidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botifyapidays
apidays LIVE Paris 2021 - APIs and the Future of Software
December 7, 8 & 9, 2021
Building an analytics API
David Wobrock, Senior Lead API Engineer at Botify
The DSpace infrastructure for logging page-views and downloads has been limited to aggregations on communities, collections and items. While this already provides a wealth of aggregated information that is impossible to retrieve using Google Analytics, it still does not assist a repository manager in addressing questions such as:
“How many downloads did Professor X get through Google Scholar last month?”
Because authors are represented as metadata on items, tackling this challenge effectively means opening the potential to aggregate pageview and download statistics on any metadata field in the repository.
By the time of the conference, functionality that addresses this need will be available as part of @mire’s Content and Usage analysis module. The metadata based usage statistics were realized in co-development with the World Bank.
Presentation created by Lieven Droogmans, Art Lowel and Ignace Deroost.
Presented at Open Repositories 2015 by Ignace Deroost.
This presentation targets HDF5 application developers and anyone who is interested in the new HDF5 Library features. The following new features available in 1.8.0 will be discussed:
HDF5 cache
Meta data working set size is highly variable depending on file structure and access pattern. If the cache is too small, performance will deteriorate. In 1.8 we introduce code to configure metadata cache size automatically and API calls to allow manual configuration of the metadata cache.
Text - data type conversion (10 minutes)
The new high-level API function, H5LTtext_to_dtype, provides the ability to create a data type through the text description of the data type. The function H5LTdtype_to_text facilitates debugging by printing the text description of a data type. The current supported text description is in DDL format.
External Links
This feature allows links in a group to refer to objects in another file, and for the library to access those objects as if they are in the current file. We will present the API functions and how external links are supported.
Group revisions
We will introduce new features of the HDF5 Group object that include compact group storage, new large group storage, intermediate Group Creation and support of Unicode for the HDF5 object's names and datatypes. We will also cover new APIs for copying HDF5 objects between HDF5 files.
Compact Groups – This feature allows groups containing only a few links to take up much less space in the file.
New Large Group Storage – The method of storing groups with many links has been updated to be faster and more scalable.
Intermediate Group Creation – This feature allows intermediate groups that don't exist yet to be created when creating an object in a file.
Support for Unicode Character Set – The UTF-8 Unicode encoding is now supported for strings in datasets, the names of links and the names of attributes.
Similar to Integrating Hadoop in Your Existing DW and BI Environment (20)
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
4. Presentation Outline
! 1. The standard model
! 2. The 3 stages of Hadoop adoption
! 3. Cloudera partnerships
! 4. Analytics at eBay
! Questions and Discussion
Wednesday, November 17, 2010
5. 1. The Standard Model
Data Warehousing and Business Intelligence
Wednesday, November 17, 2010
13. Stage 1
Copy or Archive
Application
Database
Application
Requests
Data
Warehouse
ETL
Business
Intelligence
Analytics
Hadoop
Wednesday, November 17, 2010
14. Stage 1
Add Unstructured Data
Application
Database
Application
Requests
Data
Warehouse
ETL
Business
Intelligence
Analytics
Hadoop
Wednesday, November 17, 2010
15. Stage 1
Consolidate Multiple Data Warehouses
Application
Database
Data
Warehouse
ETL
Hadoop
Application
Database
Data
Warehouse
ETL
Wednesday, November 17, 2010
16. Stage 2
On the Critical Path
Wednesday, November 17, 2010
17. Stage 2
Structure and Store
Application
Database
Application
Requests
Data
Warehouse
Business
Intelligence
Analytics
Hadoop
Wednesday, November 17, 2010
18. Stage 3
Ad Hoc Query Support
Wednesday, November 17, 2010
22. Cloudera Partnerships
Cloud, Hardware, and OS
! Processor
! AMD, Intel
! Server
! Acer, HP, Supermicro
! OS
! Canonical
! Cloud
! VMware vCloud
! CDH runs on AWS and Rackspace Cloud as well
Wednesday, November 17, 2010
27. 1
eBay’s Data Scale
• eBay manages …
• Over 90 million active users worldwide
• Over 220 million items for sale
• Over 10 billion URL requests per day
• • … in a dynamic environment
• Tens of new features each week
• Roughly 10% of items are listed or ended every day
• Collect Everything
• eBay processes 40TB of new, incremental data per day
• eBay analyzes 40PB of data per day
• Store every historical item and purchase
eBay has one of the largest EDW system and is building one of the world’s
largest Hadoop clusters
30. Data Sourcing Patterns
4
Source Preparation Format Pattern / Learning
Click Stream
Session
Event
Session
Container
Session/Event Streamed as Gzip/
Binary. Prepared as LZO/Text.
Session/Event Data
Build an index and use LzoTextInputFormat
for splits
Session Container - a join of
Session and corresponding Event
data.
Prepared as Sequence Files.
Session Container - Secondary sort with
reduce side join
EDW
Item
Transaction
User
Feedback
Bids
Incremental feed streamed and
maintained as GZIP/Text.
Smaller data set , keep it in the original
format.
Prepare a snapshot as
SequenceFile.
Rebuild daily snapshot with previous
snapshot and incremental day’s data.
Build a Hive table on snapshot data Create external Hive table which points to
SequenceFile
HBase
a) Leverage TotalOrderPartitoner
with RandomSamplers to identify
partition ranges for reducers.
b) Create HBaseregions using Hfile
c) Update RegionServers using
ruby script loadtable.rb
Learning
a) Incremental data not temporal/sparse,
hence not suitable as versions in a column
oriented DB.
b) HBase insert vs. append performance,
120K vs. 12K rows per sec
c) Hfile flush durability issues HBASE-1923
31. Hadoop Ecosystem
5
5
Hadoop Core
(HDFS,Common)
MapReduce
(Java, Streaming, Pipes,Scala)
Data Access
(Hbase, Pig, Hive)
Tools & Libraries
(HUE,UC4,Oozie.Mobius,Mahout)
Monitoring & Alerting
(Ganglia, Nagios)
• MapReduce
Sourcing data primarily Java
Applications using Perl, Scala, Python…
• Data Access Frameworks
Pig – data piplelines
Hive – Adhoc queries
MQL – Mobius Query Language
• Monitoring & Alerting
Ganglia, Nagios, Cloudera Enterprise
• Tools & Libraries
HUE/Mobius – lifecycle of user jobs
UC4 ‐ scheduling
Oozie – user workflow and data pipelines
Mahout – data mining