In recent years, Apache™ Hadoop® has emerged from humble beginnings to disrupt the traditional disciplines of information management. As with all technology innovation, hype is rampant, and data professionals are easily overwhelmed by diverse opinions and confusing messages.
Even seasoned practitioners sometimes miss the point, claiming for example that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate since both Hadoop and Teradata® systems run in parallel, scale up to enormous data volumes and have shared-nothing architectures. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. This session will shed light on the differences and help architects, engineering executives, and data scientists identify when to deploy Hadoop and when it is best to use MPP relational database in a data warehouse, discovery platform, or other workload-specific applications.
Two of the most trusted experts in their fields, Steve Wooledge, VP of Product Marketing from Teradata and Jim Walker of Hortonworks will examine how big data technologies are being used today by practical big data practitioners.
Trending use cases have pointed out the complementary nature of Hadoop and existing data management systems—emphasizing the importance of leveraging SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve distributed analytic processing. Many vendors have provided interfaces between SQL systems and Hadoop but have not been able to semantically integrate these technologies while Hive, Pig and SQL processing islands proliferate. This session will discuss how Teradata is working with Hortonworks to optimize the use of Hadoop within the Teradata Analytical Ecosystem to ingest, store, and refine new data types, as well as exciting new developments to bridge the gap between Hadoop and SQL to unlock deeper insights from data in Hadoop. The use of Teradata Aster as a tightly integrated SQL-MapReduce® Discovery Platform for Hadoop environments will also be discussed.
TM Forum Webinar - Telco API-driven digital marketplace opportunities | Post-...ShubaS4
If you missed the live webinar, you can catch all the details here in this presentation. Expert speakers Karthik TS and Dean Ramsay discussed CSP strategies for a new breed of marketplaces in this on-demand webinar. This slide deck provides a comprehensive overview of the LIVE webinar and is a great resource for CSPs looking for out-of-the-box API-driven digital marketplace solutions.
Informatica provides the market's leading data integration platform. Tested on nearly 500,000 combinations of platforms and applications, the data integration platform inter operates with the broadest possible range of disparate standards, systems, and applications. This unbiased and universal view makes Informatica unique in today's market as a leader in the data integration platform. It also makes Informatica the ideal strategic platform for companies looking to solve data integration issues of any size.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Composable data for the composable enterpriseMatt McLarty
I gave this talk at API Days Australia on September 15, 2021. It explores the intersection of the OLTP and OLAP worlds, and the role APIs play in bridging them. This talk introduces API-led Data Connectivity (ALDC).
Trending use cases have pointed out the complementary nature of Hadoop and existing data management systems—emphasizing the importance of leveraging SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve distributed analytic processing. Many vendors have provided interfaces between SQL systems and Hadoop but have not been able to semantically integrate these technologies while Hive, Pig and SQL processing islands proliferate. This session will discuss how Teradata is working with Hortonworks to optimize the use of Hadoop within the Teradata Analytical Ecosystem to ingest, store, and refine new data types, as well as exciting new developments to bridge the gap between Hadoop and SQL to unlock deeper insights from data in Hadoop. The use of Teradata Aster as a tightly integrated SQL-MapReduce® Discovery Platform for Hadoop environments will also be discussed.
TM Forum Webinar - Telco API-driven digital marketplace opportunities | Post-...ShubaS4
If you missed the live webinar, you can catch all the details here in this presentation. Expert speakers Karthik TS and Dean Ramsay discussed CSP strategies for a new breed of marketplaces in this on-demand webinar. This slide deck provides a comprehensive overview of the LIVE webinar and is a great resource for CSPs looking for out-of-the-box API-driven digital marketplace solutions.
Informatica provides the market's leading data integration platform. Tested on nearly 500,000 combinations of platforms and applications, the data integration platform inter operates with the broadest possible range of disparate standards, systems, and applications. This unbiased and universal view makes Informatica unique in today's market as a leader in the data integration platform. It also makes Informatica the ideal strategic platform for companies looking to solve data integration issues of any size.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Composable data for the composable enterpriseMatt McLarty
I gave this talk at API Days Australia on September 15, 2021. It explores the intersection of the OLTP and OLAP worlds, and the role APIs play in bridging them. This talk introduces API-led Data Connectivity (ALDC).
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Many ML Platforms cover data collection, feature engineering, training, deploying, productionalization, and monitoring but few, if any, do all of the above seamlessly.
Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python and Spark and can be used in modular pieces as each ML problem presents unique challenges. Through standardization of the path to production, training environments and the methods for collecting and transforming data on Spark, each model is reproducible and iterable.
This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adapted in Airbnb and we have variety of models running in production. We have seen the overall model development time go down from many months to days on Bighead. We plan to open source Bighead to allow the wider community to benefit from our work.
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
Data Lake or Data Swamp? By now, we’ve likely all heard the comparison. Data Lake architectures have the opportunity to provide the ability to integrate vast amounts of disparate data across the organization for strategic business analytic value. But without a proper architecture and metadata management strategy in place, a Data Lake can quickly devolve into a swamp of information that is difficult to understand. This webinar will offer practical strategies to architect and manage your Data Lake in a way that optimizes its success.
The data architecture of solutions is frequently not given the attention it deserves or needs. Frequently, too little attention is paid to designing and specifying the data architecture within individual solutions and their constituent components. This is due to the behaviours of both solution architects ad data architects.
Solution architecture tends to concern itself with functional, technology and software components of the solution
Data architecture tends not to get involved with the data aspects of technology solutions, leaving a data architecture gap. Combined with the gap where data architecture tends not to get involved with the data aspects of technology solutions, there is also frequently a solution architecture data gap. Solution architecture also frequently omits the detail of data aspects of solutions leading to a solution data architecture gap. These gaps result in a data blind spot for the organisation.
Data architecture tends to concern itself with post-individual solutions. Data architecture needs to shift left into the domain of solutions and their data and more actively engage with the data dimensions of individual solutions. Data architecture can provide the lead in sealing these data gaps through a shift-left of its scope and activities as well providing standards and common data tooling for solution data architecture
The objective of data design for solutions is the same as that for overall solution design:
• To capture sufficient information to enable the solution design to be implemented
• To unambiguously define the data requirements of the solution and to confirm and agree those requirements with the target solution consumers
• To ensure that the implemented solution meets the requirements of the solution consumers and that no deviations have taken place during the solution implementation journey
Solution data architecture avoids problems with solution operation and use:
• Poor and inconsistent data quality
• Poor performance, throughput, response times and scalability
• Poorly designed data structures can lead to long data update times leading to long response times, affecting solution usability, loss of productivity and transaction abandonment
• Poor reporting and analysis
• Poor data integration
• Poor solution serviceability and maintainability
• Manual workarounds for data integration, data extract for reporting and analysis
Data-design-related solution problems frequently become evident and manifest themselves only after the solution goes live. The benefits of solution data architecture are not always evident initially.
this is part 3 of the series on Data Mesh ... looking at the intersection of microservices architecture concepts, data integration / replication technologies and log-based stream integration techniques. This webinar was mostly a demonstration, but several slides used to setup the demo are included here as a PDF for viewers.
Unleashing the Power of OpenAI GPT-3 in FME Data Integration WorkflowsSafe Software
Join us for an eye-opening webinar where we will demonstrate the incredible power and productivity of OpenAI GPT-3 in FME data integration scenarios. From natural language processing to automated workflow generation and predictive modeling, we will show how GPT-3 can tackle even the most complex and daunting data integration challenges without a single line of code. This is a must-attend event for anyone looking to unlock the full potential of their data and streamline their integration workflows. Be prepared to be amazed and astounded at the capabilities of GPT-3 and FME!
Data Architecture - The Foundation for Enterprise Architecture and GovernanceDATAVERSITY
Organizations are faced with an increasingly complex data landscape, finding themselves unable to cope with exponentially increasing data volumes, compounded by additional regulatory requirements with increased fines for non-compliance. Enterprise architecture and data governance are often discussed at length, but often with different stakeholder audiences. This can result in complementary and sometimes conflicting initiatives rather than a focused, integrated approach. Data governance requires a solid data architecture foundation in order to support the pillars of enterprise architecture. In this session, IDERA’s Ron Huizenga will discuss a practical, integrated approach to effectively understand, define and implement an cohesive enterprise architecture and data governance discipline with integrated modeling and metadata management.
FinOps Data - FR - par Matthieu Rousseau & Ismael Goulani
Matthieu Rousseau, CEO & Data Engineer Modeo.
Ismael Goulani, CTO & Data Engineer Modeo.
Retour sur le premier prix dans la catégorie "Solution Innovante" du challenge #LaNuitdelaData avec leur solution Stach, plateforme qui aide les équipes Data à mieux comprendre l'utilisation des données par les "consumers", son coût, et son impact carbone.
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
Dremio, une architecture simple et performance pour votre data lakehouse.
Dans le monde de la donnée, Dremio, est inclassable ! C’est à la fois une plateforme de diffusion des données, un moteur SQL puissant basé sur Apache Arrow, Apache Calcite, Apache Parquet, un catalogue de données actif et aussi un Data Lakehouse ouvert ! Après avoir fait connaissance avec cette plateforme, il s’agira de préciser comment Dremio aide les organisations à relever les défis qui sont les leurs en matière de gestion et gouvernance des données facilitant l’exécution de leurs analyses dans le cloud (et/ou sur site) sans le coût, la complexité et le verrouillage des entrepôts de données.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Self-service analytics @ Leaseplan Digital: from business intelligence to int...webwinkelvakdag
Our mission is to drive digital data intelligence in a 55-year old company currently undergoing digital transformation.
We do this through cloud big data architecture and intuitive business performance visualizations based on multiple data sources across customer journeys. Join this session to find out how we are enabling enterprise wide adoption of self-service analytics both internally as single source of truth of business performance and as embedded analytics solution to end customers for real-time vehicle maintenance steering through predictive models.
In this session we will share our challenges, learnings, achievements and roadmap to embed self-service analytics in LeasePlan.
The Importance of DataOps in a Multi-Cloud WorldDATAVERSITY
There’s no denying that Cloud has evolved from being an outlying market disruptor to a mainstream method for delivering IT applications and services. In fact, it’s not uncommon to find that Enterprises use the services of more than one cloud at the same time. However, while a multi-cloud strategy offers many benefits, it also increases data management complexity and consequently reduces data availability. This webinar defines the meaning of DataOps and why it’s a crucial component for every multi-cloud approach.
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit
Learn about the industry's new open metadata standard Egeria, introduced in September by ODPi, The Linux Foundation’s Open Data Platform initiative. Egeria supports the free flow of standardized metadata between different technologies and vendor platforms, enabling organizations to locate, manage and use their data resources more effectively. Explore how Egeria's set of open APIs, types and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery and access frameworks for automating the collection, management and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed and used in order to deliver maximum value to the enterprise.
This presentation by ODPi Director John Mertic provides an introduction to Egeria, and explores how the standard provides a vendor-neutral approach to data governance. Learn how a group of companies led by ING, IBM and Hortonworks came together through the open source community to re-imagining data governance and delivered Egeria -- to automate the collection, management and use of metadata across organizations of any size and complexity. Learn how Egeria was built on open standards and delivered via Apache 2.0 open source license.
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Many ML Platforms cover data collection, feature engineering, training, deploying, productionalization, and monitoring but few, if any, do all of the above seamlessly.
Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python and Spark and can be used in modular pieces as each ML problem presents unique challenges. Through standardization of the path to production, training environments and the methods for collecting and transforming data on Spark, each model is reproducible and iterable.
This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adapted in Airbnb and we have variety of models running in production. We have seen the overall model development time go down from many months to days on Bighead. We plan to open source Bighead to allow the wider community to benefit from our work.
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
Data Lake or Data Swamp? By now, we’ve likely all heard the comparison. Data Lake architectures have the opportunity to provide the ability to integrate vast amounts of disparate data across the organization for strategic business analytic value. But without a proper architecture and metadata management strategy in place, a Data Lake can quickly devolve into a swamp of information that is difficult to understand. This webinar will offer practical strategies to architect and manage your Data Lake in a way that optimizes its success.
The data architecture of solutions is frequently not given the attention it deserves or needs. Frequently, too little attention is paid to designing and specifying the data architecture within individual solutions and their constituent components. This is due to the behaviours of both solution architects ad data architects.
Solution architecture tends to concern itself with functional, technology and software components of the solution
Data architecture tends not to get involved with the data aspects of technology solutions, leaving a data architecture gap. Combined with the gap where data architecture tends not to get involved with the data aspects of technology solutions, there is also frequently a solution architecture data gap. Solution architecture also frequently omits the detail of data aspects of solutions leading to a solution data architecture gap. These gaps result in a data blind spot for the organisation.
Data architecture tends to concern itself with post-individual solutions. Data architecture needs to shift left into the domain of solutions and their data and more actively engage with the data dimensions of individual solutions. Data architecture can provide the lead in sealing these data gaps through a shift-left of its scope and activities as well providing standards and common data tooling for solution data architecture
The objective of data design for solutions is the same as that for overall solution design:
• To capture sufficient information to enable the solution design to be implemented
• To unambiguously define the data requirements of the solution and to confirm and agree those requirements with the target solution consumers
• To ensure that the implemented solution meets the requirements of the solution consumers and that no deviations have taken place during the solution implementation journey
Solution data architecture avoids problems with solution operation and use:
• Poor and inconsistent data quality
• Poor performance, throughput, response times and scalability
• Poorly designed data structures can lead to long data update times leading to long response times, affecting solution usability, loss of productivity and transaction abandonment
• Poor reporting and analysis
• Poor data integration
• Poor solution serviceability and maintainability
• Manual workarounds for data integration, data extract for reporting and analysis
Data-design-related solution problems frequently become evident and manifest themselves only after the solution goes live. The benefits of solution data architecture are not always evident initially.
this is part 3 of the series on Data Mesh ... looking at the intersection of microservices architecture concepts, data integration / replication technologies and log-based stream integration techniques. This webinar was mostly a demonstration, but several slides used to setup the demo are included here as a PDF for viewers.
Unleashing the Power of OpenAI GPT-3 in FME Data Integration WorkflowsSafe Software
Join us for an eye-opening webinar where we will demonstrate the incredible power and productivity of OpenAI GPT-3 in FME data integration scenarios. From natural language processing to automated workflow generation and predictive modeling, we will show how GPT-3 can tackle even the most complex and daunting data integration challenges without a single line of code. This is a must-attend event for anyone looking to unlock the full potential of their data and streamline their integration workflows. Be prepared to be amazed and astounded at the capabilities of GPT-3 and FME!
Data Architecture - The Foundation for Enterprise Architecture and GovernanceDATAVERSITY
Organizations are faced with an increasingly complex data landscape, finding themselves unable to cope with exponentially increasing data volumes, compounded by additional regulatory requirements with increased fines for non-compliance. Enterprise architecture and data governance are often discussed at length, but often with different stakeholder audiences. This can result in complementary and sometimes conflicting initiatives rather than a focused, integrated approach. Data governance requires a solid data architecture foundation in order to support the pillars of enterprise architecture. In this session, IDERA’s Ron Huizenga will discuss a practical, integrated approach to effectively understand, define and implement an cohesive enterprise architecture and data governance discipline with integrated modeling and metadata management.
FinOps Data - FR - par Matthieu Rousseau & Ismael Goulani
Matthieu Rousseau, CEO & Data Engineer Modeo.
Ismael Goulani, CTO & Data Engineer Modeo.
Retour sur le premier prix dans la catégorie "Solution Innovante" du challenge #LaNuitdelaData avec leur solution Stach, plateforme qui aide les équipes Data à mieux comprendre l'utilisation des données par les "consumers", son coût, et son impact carbone.
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
Dremio, une architecture simple et performance pour votre data lakehouse.
Dans le monde de la donnée, Dremio, est inclassable ! C’est à la fois une plateforme de diffusion des données, un moteur SQL puissant basé sur Apache Arrow, Apache Calcite, Apache Parquet, un catalogue de données actif et aussi un Data Lakehouse ouvert ! Après avoir fait connaissance avec cette plateforme, il s’agira de préciser comment Dremio aide les organisations à relever les défis qui sont les leurs en matière de gestion et gouvernance des données facilitant l’exécution de leurs analyses dans le cloud (et/ou sur site) sans le coût, la complexité et le verrouillage des entrepôts de données.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Self-service analytics @ Leaseplan Digital: from business intelligence to int...webwinkelvakdag
Our mission is to drive digital data intelligence in a 55-year old company currently undergoing digital transformation.
We do this through cloud big data architecture and intuitive business performance visualizations based on multiple data sources across customer journeys. Join this session to find out how we are enabling enterprise wide adoption of self-service analytics both internally as single source of truth of business performance and as embedded analytics solution to end customers for real-time vehicle maintenance steering through predictive models.
In this session we will share our challenges, learnings, achievements and roadmap to embed self-service analytics in LeasePlan.
The Importance of DataOps in a Multi-Cloud WorldDATAVERSITY
There’s no denying that Cloud has evolved from being an outlying market disruptor to a mainstream method for delivering IT applications and services. In fact, it’s not uncommon to find that Enterprises use the services of more than one cloud at the same time. However, while a multi-cloud strategy offers many benefits, it also increases data management complexity and consequently reduces data availability. This webinar defines the meaning of DataOps and why it’s a crucial component for every multi-cloud approach.
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit
Learn about the industry's new open metadata standard Egeria, introduced in September by ODPi, The Linux Foundation’s Open Data Platform initiative. Egeria supports the free flow of standardized metadata between different technologies and vendor platforms, enabling organizations to locate, manage and use their data resources more effectively. Explore how Egeria's set of open APIs, types and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery and access frameworks for automating the collection, management and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed and used in order to deliver maximum value to the enterprise.
This presentation by ODPi Director John Mertic provides an introduction to Egeria, and explores how the standard provides a vendor-neutral approach to data governance. Learn how a group of companies led by ING, IBM and Hortonworks came together through the open source community to re-imagining data governance and delivered Egeria -- to automate the collection, management and use of metadata across organizations of any size and complexity. Learn how Egeria was built on open standards and delivered via Apache 2.0 open source license.
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.
Presentation at Data Summit 2015 in NYC.
Elliott Cordo shared real-world insights across a range of topics, including the evolving best practices for building a data warehouse on Hadoop that also coexists with multiple processing frameworks and additional non-Hadoop storage platforms, the place for massively parallel-processing and relational databases in analytic architectures, and the ways in which the cloud offers the ability to quickly and cost-effectively establish a scalable platform for your Big Data warehouse.
For more information, visit www.casertaconcepts.com
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...Kai Wähner
I discuss a good big data architecture which includes Data Warehouse / Business Intelligence + Apache Hadoop + Real Time / Stream Processing. Several real world example are shown. TIBCO offers some very nice products for realizing these use cases, e.g. Spotfire (Business Intelligence / BI), StreamBase (Stream Processing), BusinessEvents (Complex Event Processing / CEP) and BusinessWorks (Integration / ESB). TIBCO is also ready for Hadoop by offering connectors and plugins for many important Hadoop frameworks / interfaces such as HDFS, Pig, Hive, Impala, Apache Flume and more.
In this presentation, Scott Gnau from Teradata Labs presents: Teradata Intelligent Memory.
<blockquote>The introduction of Teradata Intelligent Memory allows our customers to exploit the performance of memory within Teradata Platforms, which extends our leadership position as the best performing data warehouse technology at the most competitive price,” said Scott Gnau, president, Teradata Labs. “Teradata Intelligent Memory technology is built into the data warehouse and customers don’t have to buy a separate appliance. Additionally, Teradata enables its customers to buy and configure the exact amount of in-memory capability needed for critical workloads. It is unnecessary and impractical to keep all data in memory, because all data do not have the same value to justify being placed in expensive memory.” </blockquote>
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Cloudera, Inc.
Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.
BSI Teradata: The Shocking Case of Home Electronics PlanetTeradata
Home Electronics Planet, a big-box retailer, has digital marketing campaigns that are failing. Their Chief Marketing Officer gets some analytics and data science help from Business Scenario Investigators who recommend changing their search keywords mix, creating tighter customer segments based on product purchase sequencing coupled with real-time web page personalizations, and revising their e-mail marketing to improve business results.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
This presentation is an explanation of the research work done in the topic of 'hadoop integration into data warehouse architectures'. It explains where Hadoop fits into data warehouse architecture. Furthermore, it purposes a BI assessment model to determine the capability of current BI program and how to define roadmap for its maturity.
Complement Your Existing Data Warehouse with Big Data & HadoopDatameer
To view the full webinar, please go to: http://info.datameer.com/Slideshare-Complement-Your-Existing-EDW-with-Hadoop-OnDemand.html
With 40% yearly growth in data volumes, traditional data warehouses have become increasingly expensive and challenging.
Much of today’s new data sources are unstructured, making the structured data warehouse an unsuitable platform for analyses. As a result, organizations now look at Hadoop as a data platform to complement existing BI data warehouses, and a scalable, flexible and cost-effective solution for data storage and analysis.
Join Datameer and Cloudera in this webinar to discuss how Hadoop and big data analytics can help to:
-Get all the data your business needs quickly into one environment
Shorten the time to insight from months to days
Extend the life of your existing data warehouse investments
Enable your business analysts to ask and answer bigger questions
Build a Big Data Warehouse on the Cloud in 30 MinutesCaserta
Elliott Cordo, Chief Architect at Caserta Concepts will give a live demo using Amazon's AWS to build a Big Data Warehouse using S3 for data storage, Elastic MapReduce (EMR) for data manipulation and Redshift for interactive queries.
For more information, visit http://www.casertaconcepts.com/.
The Intelligent Thing -- Using In-Memory for Big Data and BeyondInside Analysis
The Briefing Room with John O'Brien and Teradata
Live Webcast on June 11, 2013
http://www.insideanalysis.com
For traditional Data Warehousing and Big Data Analytics, research shows that a small percentage of enterprise data often comprises the lion's share of what's needed for queries. That's hot data, and organizations that know how to effectively harness that data can stay on top of what's happening. Conversely, cold data can certainly provide value at times, but should ideally be stored in ways that minimize cost. The more dynamically a company can manage this hot and cold data, the more efficient its information systems become.
Register for this episode of The Briefing Room to hear veteran database expert John O'Brien of Radiant Advisors as he outlines a strategy for managing hot and cold data. He'll be briefed by Alan Greenspan of Teradata, who will tout his company's Intelligent In-Memory solution, which optimizes the management of hot and cold data to keep analysts fueled with the data they need most. He'll also discuss Teradata Virtual Storage, which helps optimize the storage and provisioning of information assets.
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
In this webinar, Carl W. Olofson, Research Vice President, Application Development and Deployment for IDC, and Dale Kim, Director of Industry Solutions for MapR, will provide an insightful outlook for Hadoop in 2015, and will outline why enterprises should consider using Hadoop as a "Decision Data Platform" and how it can function as a single platform for both online transaction processing (OLTP) and real-time analytics.
Creating a Next-Generation Big Data ArchitecturePerficient, Inc.
If you’ve spent time investigating Big Data, you quickly realize that the issues surrounding Big Data are often complex to analyze and solve. The sheer volume, velocity and variety changes the way we think about data – including how enterprises approach data architecture.
Significant reduction in costs for processing, managing, and storing data, combined with the need for business agility and analytics, requires CIOs and enterprise architects to rethink their enterprise data architecture and develop a next-generation approach to solve the complexities of Big Data.
Creating the data architecture while integrating Big Data into the heart of the enterprise data architecture is a challenge. This webinar covered:
-Why Big Data capabilities must be strategically integrated into an enterprise’s data architecture
-How a next-generation architecture can be conceptualized
-The key components to a robust next generation architecture
-How to incrementally transition to a next generation data architecture
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresKangaroot
Postgres is the leading open source database management system that is being developed by a very active community for more than 15 years. Gaby Schilders is Sales Engineer at EnterpriseDB, supplier of the EDB Postgres data platform.
Gaby Schilders, Sales Engineer at EnterpriseDB, will be explaining why companies take open source as the centerpiece for modernising their IT infrastructure, thus increasing their scalability and taking full advantage today's technologies offer them.
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
Erik Baardse and Ajit Gadge from EDB Postgres presented on how to transform your DBMS in order to drive digital business. How Postgres enables you to support a wider range of workloads with your relational database which opens the Big Data doors. They also cover EnterpriseDB’s Strategy around Big Data which focuses on 3 areas and finally last but not the last how to find money in IT with Big Data and digital transformation
to effectively analyze this kind of information is now seen as a key competitive advantage to better inform decisions. In order to do so, organizations employ Sentiment Analysis (SA) techniques on these data. However, the usage of social media around the world is ever-increasing, which considerably accelerates massive data generation and makes traditional SA systems unable to deliver useful insights. Such volume of data can be efficiently analyzed using the combination of SA techniques and Big Data technologies. In fact, big data is not a luxury but an essential necessary to make valuable predictions. However, there are some challenges associated with big data such as quality that could highly affect the SA systems’ accuracy that use huge volume of data. Thus, the quality aspect should be addressed in order to build reliable and credible systems. For this, the goal of our research work is to consider Big Data Quality Metrics (BDQM) in SA that rely of big data. In this paper, we first highlight the most eloquent BDQM that should be considered throughout the Big Data Value Chain (BDVC) in any big data project. Then, we measure the impact of BDQM on a novel SA method accuracy in a real case study by giving simulation results.
Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
Watch full webinar here: https://bit.ly/3hgOSwm
Data Lake technologies have been in constant evolution in recent years, with each iteration primising to fix what previous ones failed to accomplish. Several data lake engines are hitting the market with better ingestion, governance, and acceleration capabilities that aim to create the ultimate data repository. But isn't that the promise of a logical architecture with data virtualization too? So, what’s the difference between the two technologies? Are they friends or foes? This session will explore the details.
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Similar to Hadoop and the Data Warehouse: When to Use Which (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Hadoop and the Data Warehouse: When to Use Which
1. HADOOP & THE DATA WAREHOUSE:
WHEN TO USE WHICH
Steve Wooledge – Teradata Labs
Jim Walker – Hortonworks
1
2. Topics
• Trends in enterprise data architectures
• The value of an integrated data warehouse
• The value of Hadoop
• Bringing it all together and next steps
3. Big Data Comes with BIG HEADACHES
Even free software like Hadoop is causing
companies to spend more money…Many CIOs believe
data is inexpensive because storage has become
inexpensive. But data is inherently messy—it can be
wrong, it can be duplicative, and it can be irrelevant—
which means it requires handling, which is where the
real expenses come in.
“
”
Through 2015, 85% of Fortune 500 organizations will
be unable to exploit big data for competitive advantage.
“ ”Source: The Wall Street Journal. “CIOs’ Big Problem with Big Data”. Aug 2012
Source: Gartner. “Information Innovation: Innovation Key Initiative Overview”. April 2012
4. Organizations Face Several Obstacles with Big Data
Source: Big Analytics 2012 Survey, Teradata
Difficulty
managing
multiple systems,
new types of data
Hard to find right
skills; Lack of
supportability
for new systems &
“data scientists”
Difficulty
deploying and
integrating new
systems
Difficulty
providing
accessibility to
fast insights on
big data
5. Shift from a Single Platform to an Ecosystem
“Big Data requirements are solved by
a range of platforms including
analytical databases, discovery
platforms, and NoSQL solutions
beyond Hadoop.”
“We will abandon the old models
based on the desire to implement for
high-value analytic applications.”
"Logical" Data Warehouse
Source: “Big Data Comes of Age”. EMA and 9sight Consulting. Nov 2012.
6. AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
DISCOVERY
PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED
DATA WAREHOUSE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists
Business Analysts
Front-Line WorkersCustomers / PartnersMarketing
Operational SystemsExecutives
TERADATA UNIFIED DATA ARCHITECTURE
7. Topics
• Trends in enterprise data architectures
• The value of an integrated data warehouse
• The value of Hadoop
• Bringing it all together and next steps
8. DUAL
SYSTEMS
DATA
MARTS
ANALYTICAL
ARCHIVE
TEST/
DEV
The Value of The Data Warehouse
INDEPENDENT
DATA MART
Business Analysts
Knowledge Workers
DATA MININGBUSINESS INTELLIGENCE APPLICATIONS
Customers/Partners
Marketing
Executives
Front-line Workers
Operational Systems
INTEGRATED
DATA WAREHOUSE
DATA
LAB
Integrated Analytics
Advanced
Analytics
Temporal
OLAP
Optimization
Geospatial
Big Data
Integration
Application
Development
Agile
Analytics
Data
Exploration
Benefits
•Easy to consume data
•Rationalization of data
from multiple sources
into single enterprise
view
•Clean, safe, secure data
•Cross-functional
analysis
•Transform once, use
many
•Fast response times
9. SQL Advantages with an MPP RDBMS
• Full ANSI SQL:
• The lingua franca of business users when accessing data
• Decades of standardization (stable, feature rich, portable)
• Mature 3rd Party SQL based tools that provide business users with
self service direct access to the data
• BI Tools
• In-database statistical packages
• Analytic applications (CRM, SCM, MDM)
• Easily parallelized
• Scalable when manipulating large data sets
6/27/2013 9
10. ACID Advantages in an MPP RDBMS
• Guarantees database actions are
processed reliably
• Ensures 100% query result accuracy
• Supports updates and deletes
• Needed for applications that require
100% consistency
6/27/2013 10
Atomicity - All of the pieces are
committed or none are committed.
Consistency - Creates a new and
valid state of data, or, if any failure
occurs, returns all data to its original
state.
Isolation - Processed and not yet
committed transactions must remain
isolated from any other
transactions.
Durability - Committed data is
saved such that in event of a failure
and system restart, the data is
available in its correct state.
11. Tight Vertical Integration
• End-to-end management of resources
• Efficient utilization of resources
• Engineered extremely well for known data
• Fine-grained parallelism and resource management
• Consistency of service level delivery
Best Practices Management:
• Workload functions
• Workload groups
• Exceptions
• Priorities
• Time periods
12. Low Latency Advantages of MPP RDBMS
Multi-temperature
storage with automated
distribution of data based
on access patterns:
• In-Memory
• Solid-State Drives
• Fast Hard Drives
• Fat Hard Drives
6/27/2013 12
• Indexes
• Statistics
• Advanced partitioning
13. Cost Based Optimizer Advantages in an MPP RDBMS
• Best practices optimizer determines how
the query will be processed most
efficiently, with no “hints” or degrees of
parallelism necessary.
• In chess, you can look out a few moves
to decide your best next move, but you
can’t envision all move and countermove
sequences for the entire game:
• The Grand Master has the
knowledge, experience, and
intelligence to identify and use
the right strategy.
• With Hadoop, the user takes a
heavy role in optimizing the
execution of queries.
• With an MPP RDBMS, the
software is the optimizer.
6/27/2013 13
Query Rewrite
• semantic optimization
• different types of vendor tools
Fast/Efficient Data Access
• Access path - Indexing
• Partitioning (CP & PPI)
• Advanced partitioning schemes
(range & case based, multilevel,
dynamic)
• IO Optimizations (efficient
scans/sync scan) scan optimization
Query Complexity
• Join costing & planning
• Aggregation
Many ways to process a complex query…
14. Granular Security Advantages in an MPP RDBMS
• Row level security
• Column level security
• An MPP RDBMS tightly integrates mature security features
• User-level security controls
• Increased user authentication options
• Support for security roles
• Enterprise directory integration
• Auditing and monitoring controls
• Encryption
6/27/2013 14
16. Topics
• Trends in enterprise data architectures
• The value of an integrated data warehouse
• The value of Hadoop
• Bringing it all together and next steps
25. AUDIO & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP
DISCOVERY
PLATFORM
CAPTURE | STORE | REFINE
INTEGRATED
DATA WAREHOUSE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
Engineers
Data Scientists
Business Analysts
Front-Line WorkersCustomers / PartnersMarketing
Operational SystemsExecutives
TERADATA UNIFIED DATA ARCHITECTURE
28. Organizations Face Several Obstacles with Big Data
Source: Big Analytics 2012 Survey, Teradata
Difficulty
managing
multiple systems,
new types of data
Hard to find right
skills; Lack of
supportability
for new systems &
“data scientists”
Difficulty
deploying and
integrating new
systems
Difficulty
providing
accessibility to
fast insights on
big data
29. Topics
• Trends in enterprise data architectures
• The value of an integrated data warehouse
• The value of Hadoop
• Bringing it all together and next steps
39. Teradata Portfolio for Hadoop
”Taking Hadoop from Silicon Valley to Main Street”
Most Trusted & Flexible Hadoop Platforms for Your Next-Generation
Unified Data Architecture™
1. Teradata Aster Big Analytics Appliance
2. Teradata Appliance for Hadoop
3. Teradata Commodity Offering for Hadoop (Dell)
4. Teradata Software-only for Hadoop (Hortonworks Data Platform)
Complete consulting and training capability
• Big Analytics Services – across the UDA
• Data Integration Optimization – ETL, ELT across the UDA
• Hadoop deployment & mentoring
• Teradata delivering Hortonworks training
• Hadoop Managed Services - operations & administration
Customer Support for Hadoop
• World-class Teradata customer support, backed by Hortonworks
What We Announced Today
40. Teradata Appliance for Hadoop
Value-Added Software Bringing Hadoop to Enterprise
Access: SQL-H™
Management: Viewpoint, TVI
Administration: Hadoop Builder,
Intelligent start/stop, DataNode
swap, deferred drive replace
High Availability : NameNode
HA, Master Machine Failover
Refining, Metadata,
Entity Resolution
Security & Data Access
HCatalog KerberosKerberos
41. 41 6/27/2013 Teradata Confidential
Complete Consulting and Training Capability
Post-sale Services Areas of Focus
Teradata Analytic
Architecture Services
Services to scope, design, build, operate and maintain an optimal UDA approach
for Teradata, Aster, and Hadoop
Teradata DI Optimization Assess structured/non-structured data, discuss data loading techniques,
determine best platform, optimize load scripts/processes
Teradata Big Analytics Assess data value/cost of capture, identify source of “exhaust” data, create
conceptual architecture, refine and enrich the data, implement initial analytics in
Aster or best-fit tool
Teradata Workshop for
Hadoop
Introduction workshop (across all of UDA)
Teradata Data Staging for
Hadoop
Load data into landing-area; set-up data exploration/refining area; Scope
architecture and analytics; set-up Hadoop repository; Load sample data
Teradata Platform for
Hadoop
Installation guidance and mentoring for Hadoop platform, D-I-Y after installation
Teradata Managed
Services for Hadoop
Operations, management, administration, backup, security, process control for
Hadoop
Teradata Training Courses
for Hadoop
Two comprehensive, multi-day training offerings: 1) Administration of Apache
Hadoop and 2) Developing Solutions Using Apache Hadoop
42. 42 6/27/2013 Teradata Confidential
When to Use Which?
The best approach by workload and data type
Processing as a Function of Schema Requirements and Stage of Data Pipeline
Low Cost
Storage and
Fast Loading
Data Pre-
Processing,
Refining,
Cleansing
“Simple math
at scale”
(Score, filter,
sort, avg.,
count...)
Joins,
Unions,
Aggregates
Analytics
(Iterative and
data mining)
Reporting
Stable
Schema
Evolving
Schema
Aster
(SQL +
MapReduce
Analytics)
Format,
No Schema
Hadoop Hadoop Hadoop Aster Aster
Aster
(MapReduce
Analytics)
Teradata/
Hadoop
Teradata Teradata Teradata Teradata Teradata
Hadoop
Aster /
Hadoop
Aster /
Hadoop
Aster Aster Aster
Hadoop Hadoop Hadoop Aster Aster Aster
Financial Analysis, Ad-Hoc/OLAP
Enterprise-Wide BI and Reporting
Spatial/Temporal
Active Execution
Interactive Data Discovery
Web Clickstream, Set-Top Box Analysis
CDRs, Sensor Logs, JSON
Social Feeds, Text, Image Processing
Audio/Video Storage and Refining
Storage and Batch Transformations