Learn more about Teradata's Extreme Data Appliance 1650 - http://www.teradata.com/t/extreme-data-appliance/
Also view the video: http://www.youtube.com/watch?v=iAEKsECBcyU
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data
Speaker: Sagar Kapare, Cask
Big Data Applications Meetup, 05/10/2017
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to video: https://youtu.be/mSKwjKvYUtI
About the talk:
The cost of maintaining a traditional Enterprise Data Warehouse (EDW) is skyrocketing as legacy systems buckle under the weight of exponentially growing data and increasingly complex processing needs. Hadoop, with its massive horizontal scalability, and CDAP which offers pre-built pipelines for EDW Offload in a drag&drop studio environment, can help.
Sagar will demonstrate Cask’s solution, which shows how to build code-free, scalable, and enterprise-grade pipelines for delivering an easy-to-use and efficient EDW offload solution. He will also show how interactive data preparation, data pipeline automation, and fast querying capabilities over voluminous data can help unlock new use-cases.
United Airlines is leveraging big data at the enterprise level to help drive revenue, improve the customer experience, optimize operations, and support our employees in their day-to-day activities. At the center of our big data stack is Apache Hadoop, supported by many other emerging open source frameworks that must be integrated with the myriad of operational systems that support a 90-year-old transportation company with worldwide operations. In addition, learn how streaming data and streaming data analytics are helping to drive operational decisions in real time and how this is being architected to scale horizontally to take advantage of high availability and parallel processing. With the rapidly evolving Hadoop ecosystem, and so many new open source technologies at our disposal, the options for solving long-standing industry problems such as modeling how customers make decisions, making timely and meaningful real-time offers, and optimizing logistical operations have never been better. JOE OLSON, Senior Manager, Big Data Analytics, United Airlines and JONATHAN INGALLS, Sr. Solutions Engineer, Hortonworks
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...DataWorks Summit
American Water will share the success story of American Water’s production use case of leveraging Hadoop and Streaming to ingest and supply de-normalized data from the source transactional systems to end-user applications. It covers the end-to-end flow and the challenges faced.
The data is de-normalized into single subject views at the source to eliminate complex join logic during ingestion into the data lake. Within the views, only timestamps on highly volatile tables have been exposed to give visibility to updates and inserts that have occurred on a table. NiFi ingests the data with a new processor and then stores it in ACID tables in Hive. The custom processor polls the timestamp columns, which generates paginated queries that consists of the delta.
American Water’s use case: Our field employees are our front line with our customers and in the past have felt unable to help customers effectively with our past technologies. One of the largest initiatives is to enable our field employees with accurate and up-to-date information via a new application so they can provide a great customer experience.
Speaker
John Kuchmek, American Water, Sr. Technologist
Adam Michalsky, American Water, Senior Technologist
From an experiment to a real production environmentDataWorks Summit
Rabobank is a worldwide food- and agri-bank from the Netherlands. Rabobank wants to make a substantial contribution to welfare and prosperity in the Netherlands and to feeding the world sustainably. Rabobank Group operates through Rabobank and its subsidiaries in 40 countries.
Rabobank is active in both retail and wholesale banking. For our wholesale clients we provide real-time business insight information by making use of Cloudera and Hortonworks technology. An example is our recently launched service that gives insight in market performance of Rabobank customers, starting with the dairy farmers market segment, by making use of benchmark information. Our current technology stack contains Hortonworks Data Flow (HDF) and Cloudera Hadoop (CDH). Our real-time data stream is implemented by making use of Kafka and Nifi from HDF. Cloudera is used to store the data needed for the business insight information, mainly in HDFS and HBase.
During our presentation we will provides insight about the project approach, the architecture and actual implementation.
Speaker
Jeroen Wolffensperger, Solution Architect Data, Rabobank
Martijn Groen, Delivery Manager Data , Rabobank Netherlands
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data
Speaker: Sagar Kapare, Cask
Big Data Applications Meetup, 05/10/2017
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to video: https://youtu.be/mSKwjKvYUtI
About the talk:
The cost of maintaining a traditional Enterprise Data Warehouse (EDW) is skyrocketing as legacy systems buckle under the weight of exponentially growing data and increasingly complex processing needs. Hadoop, with its massive horizontal scalability, and CDAP which offers pre-built pipelines for EDW Offload in a drag&drop studio environment, can help.
Sagar will demonstrate Cask’s solution, which shows how to build code-free, scalable, and enterprise-grade pipelines for delivering an easy-to-use and efficient EDW offload solution. He will also show how interactive data preparation, data pipeline automation, and fast querying capabilities over voluminous data can help unlock new use-cases.
United Airlines is leveraging big data at the enterprise level to help drive revenue, improve the customer experience, optimize operations, and support our employees in their day-to-day activities. At the center of our big data stack is Apache Hadoop, supported by many other emerging open source frameworks that must be integrated with the myriad of operational systems that support a 90-year-old transportation company with worldwide operations. In addition, learn how streaming data and streaming data analytics are helping to drive operational decisions in real time and how this is being architected to scale horizontally to take advantage of high availability and parallel processing. With the rapidly evolving Hadoop ecosystem, and so many new open source technologies at our disposal, the options for solving long-standing industry problems such as modeling how customers make decisions, making timely and meaningful real-time offers, and optimizing logistical operations have never been better. JOE OLSON, Senior Manager, Big Data Analytics, United Airlines and JONATHAN INGALLS, Sr. Solutions Engineer, Hortonworks
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...DataWorks Summit
American Water will share the success story of American Water’s production use case of leveraging Hadoop and Streaming to ingest and supply de-normalized data from the source transactional systems to end-user applications. It covers the end-to-end flow and the challenges faced.
The data is de-normalized into single subject views at the source to eliminate complex join logic during ingestion into the data lake. Within the views, only timestamps on highly volatile tables have been exposed to give visibility to updates and inserts that have occurred on a table. NiFi ingests the data with a new processor and then stores it in ACID tables in Hive. The custom processor polls the timestamp columns, which generates paginated queries that consists of the delta.
American Water’s use case: Our field employees are our front line with our customers and in the past have felt unable to help customers effectively with our past technologies. One of the largest initiatives is to enable our field employees with accurate and up-to-date information via a new application so they can provide a great customer experience.
Speaker
John Kuchmek, American Water, Sr. Technologist
Adam Michalsky, American Water, Senior Technologist
From an experiment to a real production environmentDataWorks Summit
Rabobank is a worldwide food- and agri-bank from the Netherlands. Rabobank wants to make a substantial contribution to welfare and prosperity in the Netherlands and to feeding the world sustainably. Rabobank Group operates through Rabobank and its subsidiaries in 40 countries.
Rabobank is active in both retail and wholesale banking. For our wholesale clients we provide real-time business insight information by making use of Cloudera and Hortonworks technology. An example is our recently launched service that gives insight in market performance of Rabobank customers, starting with the dairy farmers market segment, by making use of benchmark information. Our current technology stack contains Hortonworks Data Flow (HDF) and Cloudera Hadoop (CDH). Our real-time data stream is implemented by making use of Kafka and Nifi from HDF. Cloudera is used to store the data needed for the business insight information, mainly in HDFS and HBase.
During our presentation we will provides insight about the project approach, the architecture and actual implementation.
Speaker
Jeroen Wolffensperger, Solution Architect Data, Rabobank
Martijn Groen, Delivery Manager Data , Rabobank Netherlands
SQL is the most widely used language for data processing. It allows users to concisely and easily declare their business logic. Data analysts usually do not have complex software programing backgrounds, but they can program SQL and use it on a regular basis to analyze data and power the business decisions. Apache Flink is one of streaming engines that supports SQL. Besides Flink, some other stream processing frameworks, like Kafka and Spark structured streaming, have SQL-like DSL, but they do not have the same semantics as Flink. Flink’s SQL implementation follows ANSI SQL standard while others do not.
In this talk, we will present why following ANSI SQL standard is essential characteristic of Flink SQL and how we achieved this. The core business of Alibaba is now fully driven by the data processing engine: Blink, a project based on Flink with Alibaba’s improvements. About 90% of blink jobs are written by Flink SQL. We will show the use cases and the experience of running large scale Flink SQL jobs at Alibaba in the talk.
Speakers
Shaoxuan Wang, Senior Engineering Manager, Alibaba
Xiaowei Jiang, Senior Director, Alibaba
The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors at the detailed level. However, the collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. And the raw sensor and GPS data must be properly pre-processed and transformed to extract signal relevant to downstream processes. In addition, driving behavior often depends on the spatial context, and the analysis of telematics must be contextualized using spatial and real-time traffic data.
Our talk covers the promises and challenges of telematics data. We present a framework for large-scaled telematics data analysis using Apache big data tools (Hadoop, Hive, Spark, Kafka, etc). We discuss common techniques to load and transform telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
Speakers
Yanwei Zhang, Senior Data Scientists II, Uber
Neil Parker, Senior Software Engineer, Uber
Kudu as Storage Layer to Digitize Credit ProcessesDataWorks Summit
With HDFS and HBase, there are two different storage options available in the Hadoop ecosystem. Both have their strengths and weaknesses. However, neither HDFS nor HBase can be used universally for all kinds of workloads. Usually this leads to complex hybrid architectures. Kudu is a very versatile storage layer which fills this gap and simplifies the architecture of Big Data systems.
A large German bank is using Kudu as storage layer to fasten their credit processes. Within this system, financial transactions of millions of customers are analysed by Spark jobs to categorize transactions and to calculate key figures. In addition to this analytical workload, several frontend applications are using the Kudu Java API to perform random reads and writes in real-time.
The presentation will cover these topics:
- Business and technical requirements
- Data access patterns
- System architecture
- Kudu data modelling
- Kudu architecture for High Availability
- Experiences from development and operations
Speaker: Olaf Hein, Department Head & Principal Consultant
ORDIX AG
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
Achieving a 360-degree view of manufacturing via open source industrial data ...DataWorks Summit
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
Speakers
Michael Ger, General Manager Manufacturing and Automotive, Hortonworks
Wade Salazar, Solutions Engineer, Hortonworks
What’s New in Syncsort Integrate? New User Experience for Fast Data OnboardingPrecisely
We are excited to announce the new general availability of the intuitive graphical interface for DataFunnel™. This browser-based point-and-click interface gives you the ability to move hundreds of relational tables to a different RSBMS – or to Hadoop – in just minutes! Select the schema of tables you’d like to move, filter out any tables, columns or rows you’d like to exclude, and invoke – all with the click of a mouse – in a user-friendly wizard interface.
View this webinar on-demand, where we discussed the newest features in Syncsort DMX/DMX-h, DMX CDC and DataFunnel™. During this webinar, you will see a special sneak peek of some of the new exciting additions coming soon to the Syncsort data integration product family! Webinar key takeaways:
• Learn about the newest features in the Syncsort Integrate product family
• Get a sneak preview of interesting Integrate features coming soon
• See the new intuitive independent DataFunnel™ platform interface
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Reaching scale limits on a Hadoop platform: issues and errors created by spee...DataWorks Summit
Santander UK’s Big Data journey began in 2014, using Hadoop to make the most of our data and generate value for customers. Within 9 months, we created a highly available real-time customer facing application for customer analytics. We currently have 500 different people doing their own analysis and projects with this data, spanning a total of 50 different use cases. This data, (consisting of over 40 million customer records with billions of transactions), provides our business new insights that were inaccessible before.
Our business moves quickly, with several products and 20 use cases currently in production. We currently have a customer data lake and a technical data lake. Having a platform with very different workloads has proven to be challenging.
Our success in generating value created such growth in terms of data, use cases, analysts and usage patterns that 3 years later we find issues with scalability in HDFS, Hive metastore and Hadoop operations and challenges with highly available architectures with Hbase, Flume and Kafka. Going forward we are exploring alternative architectures including a hybrid cloud model, and moving towards streaming.
Our goal with this session is to assist people in the early part of their journey by building a solid foundation. We hope that others can benefit from us sharing our experiences and lessons learned during our journey.
Speaker
Nicolette Bullivant, Head of Data Engineering at Santander UK Technology, Santander UK Technology
Which Change Data Capture Strategy is Right for You?Precisely
Change Data Capture or CDC is the practice of moving the changes made in an important transactional system to other systems, so that data is kept current and consistent across the enterprise. CDC keeps reporting and analytic systems working on the latest, most accurate data.
Many different CDC strategies exist. Each strategy has advantages and disadvantages. Some put an undue burden on the source database. They can cause queries or applications to become slow or even fail. Some bog down network bandwidth, or have big delays between change and replication.
Each business process has different requirements, as well. For some business needs, a replication delay of more than a second is too long. For others, a delay of less than 24 hours is excellent.
Which CDC strategy will match your business needs? How do you choose?
View this webcast on-demand to learn:
• Advantages and disadvantages of different CDC methods
• The replication latency your project requires
• How to keep data current in Big Data technologies like Hadoop
How to Avoid Disasters via Software-Defined Storage Replication & Site RecoveryDataCore Software
Shifting weather patterns across the globe force us to re-evaluate data protection practices in locations we once thought immune from hurricanes, flooding and other natural disasters.
Offsite data replication combined with advanced site recovery methods should top your list.
In this webcast and live demo, you’ll learn about:
• Software-defined storage services that continuously replicate data, containers and virtual machine images over long distances
• Differences between secondary sites you own or rent vs. virtual destinations in public Clouds
• Techniques that help you test and fine tune recovery measures without disrupting production workloads
• Transferring responsibilities to the remote site
• Rapid restoration of normal operations at the primary facilities when conditions permit
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
More and more organizations are turning to Hadoop and NoSQL to manage big data. In fact, many IT professionals consider each of those terms to be synonymous with big data. At the same time, these two technologies are seen as different beasts that handle different challenges. That means they are often deployed in a rather disjointed way, even when intended to solve the same overarching business problem. The emerging trend of “in-Hadoop databases” promises to narrow the deployment gap between them and enable new enterprise applications. In this talk, Dale will describe that integrated architecture and how customers have deployed it to benefit both the technical and the business teams.
In this presentation we look at the use of the Hortonworks Data Flow (HDF) platform application in the finance sector operations. We define the process of deployment and use of the distribution as part of the operations and how the different components are integrated to support real-time payments and banking functionality. Documenting the challenges faced to allow a high performance system that promotes data integrity and real-time visualisation using the Hortonworks Data Platform (HDP). Focusing specially on the use of Apache Zeppelin workbooks across the business as they main information management tool. Our specific approach focus on creating a flexible system that allows fast prototyping and integrated visualisation, monitoring and audit. Demonstrating the use of the HDF distribution on the creation of business domain abstractions to in a real life application of a Domain Driven Design. We implement a cross platform use of Avro, as future prof model and language that is understood by the business and IT areas. Due to the scalability of the platform we can execute payments operations at high rate even across countries at the same time reusing the same architecture to monitor the business operations.
Speaker
Luis Caldeira, Chief Architect, Orwell Group
Gian Marco Cabiato, Head of Engineering, Orwell Group
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
NetApp receives 70 billion data points of telemetry information each day from its customer’s storage systems. This telemetry data contains configuration information, performance counters, and logs. All of this data is processed using multiple Hadoop clusters, and feeds a machine learning pipeline and a data serving infrastructure that produces insights for customers via an application called Active IQ. We describe the evolution of our Hadoop infrastructure from a traditional on-premises architecture to the hybrid cloud, and lessons learned.
We’ll discuss the insights we are able to produce for our customers, and the techniques used. Finally, we describe the data management challenges with our multi-petabyte Hadoop data lake. We solved these problems by building a unified data lake on-premises and using the NetApp Data Fabric to seamlessly connect to public clouds for data science and machine learning compute resources.
Architecting a truly hybrid cloud implementation allowed NetApp to free up our data scientists to use any software on any cloud, kept the customer log data safe on NetApp Private Storage in Equinix, resulted in faster ability to innovate and release new code and provided flexibility to use any public cloud at the same time with data on NetApp in Equinix.
Speaker
Pranoop Erasani, NetApp, Senior Technical Director, ONTAP
Shankar Pasupathy, NetApp, Technical Director, ACE Engineering
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
Speaker: Russ Savage, from Cask
Big Data Applications Meetup, 09/14/2016
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to talk: https://youtu.be/4j78g3WvC4Y
About the talk:
As data lake sizes grow, and more users begin exploring and including that data in their everyday analysis, keeping track of the sources for data becomes critical. Understanding how a dataset was generated and who is using it allows users and companies to ensure their analysis is leveraging the most accurate and up to date information. In this talk, we will explore the different techniques available to keep track of your data in your data lake and demonstrate how we at Cask approached and attempted to mitigate this issue.
Processing transactions is at the core of any bank’s business. Danske Bank’s journey started with recognising the value that could be gleaned from generating insights from the data to improve customer behaviour analytics. Today, the company streams large volumes of transactional data in near-real time onto its Hortonworks data Platform to improve fraud detection and customer marketing. In this session, Nadeem will outline the bank’s vision, how it was socialised across the executive board team and the resulting sponsorship, the technological path, challenges overcome and the results that have not only improved the customer experience but quantifiable metrics fraud and opening new revenue streams. Furthermore, Nadeem will cover future use cases around maintenance and operations.
We’re living in an era of digital disruption, where the accessibility and adoption of emerging digital technologies are enabling
enterprises to reimagine their businesses in exciting new ways. Data flows from the edge to the core to the cloud while
performing analytics and gaining actionable intelligence at all steps along the way. This connected, automated and data-driven
future enables organizations to rapidly acquire, analyze, and take action on real-time data as well as curate flows for additional
analysis at a later stage. New IoT use cases require enterprises to properly handle data in motion and create newer edge
applications with data flow management, stream processing and analytics while still being governed by existing enterprise
services.
This session highlights the importance of an edge-to-core-to-cloud digital infrastructure that can adapt to your flexing
business needs, capturing expanding data flows at the edge and aligning them to a core infrastructure that can drive insight.
Speakers
Bob Mumford, Hewlett Packard Enterprise, Big Data Solutions Architect
SQL is the most widely used language for data processing. It allows users to concisely and easily declare their business logic. Data analysts usually do not have complex software programing backgrounds, but they can program SQL and use it on a regular basis to analyze data and power the business decisions. Apache Flink is one of streaming engines that supports SQL. Besides Flink, some other stream processing frameworks, like Kafka and Spark structured streaming, have SQL-like DSL, but they do not have the same semantics as Flink. Flink’s SQL implementation follows ANSI SQL standard while others do not.
In this talk, we will present why following ANSI SQL standard is essential characteristic of Flink SQL and how we achieved this. The core business of Alibaba is now fully driven by the data processing engine: Blink, a project based on Flink with Alibaba’s improvements. About 90% of blink jobs are written by Flink SQL. We will show the use cases and the experience of running large scale Flink SQL jobs at Alibaba in the talk.
Speakers
Shaoxuan Wang, Senior Engineering Manager, Alibaba
Xiaowei Jiang, Senior Director, Alibaba
The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors at the detailed level. However, the collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. And the raw sensor and GPS data must be properly pre-processed and transformed to extract signal relevant to downstream processes. In addition, driving behavior often depends on the spatial context, and the analysis of telematics must be contextualized using spatial and real-time traffic data.
Our talk covers the promises and challenges of telematics data. We present a framework for large-scaled telematics data analysis using Apache big data tools (Hadoop, Hive, Spark, Kafka, etc). We discuss common techniques to load and transform telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
Speakers
Yanwei Zhang, Senior Data Scientists II, Uber
Neil Parker, Senior Software Engineer, Uber
Kudu as Storage Layer to Digitize Credit ProcessesDataWorks Summit
With HDFS and HBase, there are two different storage options available in the Hadoop ecosystem. Both have their strengths and weaknesses. However, neither HDFS nor HBase can be used universally for all kinds of workloads. Usually this leads to complex hybrid architectures. Kudu is a very versatile storage layer which fills this gap and simplifies the architecture of Big Data systems.
A large German bank is using Kudu as storage layer to fasten their credit processes. Within this system, financial transactions of millions of customers are analysed by Spark jobs to categorize transactions and to calculate key figures. In addition to this analytical workload, several frontend applications are using the Kudu Java API to perform random reads and writes in real-time.
The presentation will cover these topics:
- Business and technical requirements
- Data access patterns
- System architecture
- Kudu data modelling
- Kudu architecture for High Availability
- Experiences from development and operations
Speaker: Olaf Hein, Department Head & Principal Consultant
ORDIX AG
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
Achieving a 360-degree view of manufacturing via open source industrial data ...DataWorks Summit
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
Speakers
Michael Ger, General Manager Manufacturing and Automotive, Hortonworks
Wade Salazar, Solutions Engineer, Hortonworks
What’s New in Syncsort Integrate? New User Experience for Fast Data OnboardingPrecisely
We are excited to announce the new general availability of the intuitive graphical interface for DataFunnel™. This browser-based point-and-click interface gives you the ability to move hundreds of relational tables to a different RSBMS – or to Hadoop – in just minutes! Select the schema of tables you’d like to move, filter out any tables, columns or rows you’d like to exclude, and invoke – all with the click of a mouse – in a user-friendly wizard interface.
View this webinar on-demand, where we discussed the newest features in Syncsort DMX/DMX-h, DMX CDC and DataFunnel™. During this webinar, you will see a special sneak peek of some of the new exciting additions coming soon to the Syncsort data integration product family! Webinar key takeaways:
• Learn about the newest features in the Syncsort Integrate product family
• Get a sneak preview of interesting Integrate features coming soon
• See the new intuitive independent DataFunnel™ platform interface
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Reaching scale limits on a Hadoop platform: issues and errors created by spee...DataWorks Summit
Santander UK’s Big Data journey began in 2014, using Hadoop to make the most of our data and generate value for customers. Within 9 months, we created a highly available real-time customer facing application for customer analytics. We currently have 500 different people doing their own analysis and projects with this data, spanning a total of 50 different use cases. This data, (consisting of over 40 million customer records with billions of transactions), provides our business new insights that were inaccessible before.
Our business moves quickly, with several products and 20 use cases currently in production. We currently have a customer data lake and a technical data lake. Having a platform with very different workloads has proven to be challenging.
Our success in generating value created such growth in terms of data, use cases, analysts and usage patterns that 3 years later we find issues with scalability in HDFS, Hive metastore and Hadoop operations and challenges with highly available architectures with Hbase, Flume and Kafka. Going forward we are exploring alternative architectures including a hybrid cloud model, and moving towards streaming.
Our goal with this session is to assist people in the early part of their journey by building a solid foundation. We hope that others can benefit from us sharing our experiences and lessons learned during our journey.
Speaker
Nicolette Bullivant, Head of Data Engineering at Santander UK Technology, Santander UK Technology
Which Change Data Capture Strategy is Right for You?Precisely
Change Data Capture or CDC is the practice of moving the changes made in an important transactional system to other systems, so that data is kept current and consistent across the enterprise. CDC keeps reporting and analytic systems working on the latest, most accurate data.
Many different CDC strategies exist. Each strategy has advantages and disadvantages. Some put an undue burden on the source database. They can cause queries or applications to become slow or even fail. Some bog down network bandwidth, or have big delays between change and replication.
Each business process has different requirements, as well. For some business needs, a replication delay of more than a second is too long. For others, a delay of less than 24 hours is excellent.
Which CDC strategy will match your business needs? How do you choose?
View this webcast on-demand to learn:
• Advantages and disadvantages of different CDC methods
• The replication latency your project requires
• How to keep data current in Big Data technologies like Hadoop
How to Avoid Disasters via Software-Defined Storage Replication & Site RecoveryDataCore Software
Shifting weather patterns across the globe force us to re-evaluate data protection practices in locations we once thought immune from hurricanes, flooding and other natural disasters.
Offsite data replication combined with advanced site recovery methods should top your list.
In this webcast and live demo, you’ll learn about:
• Software-defined storage services that continuously replicate data, containers and virtual machine images over long distances
• Differences between secondary sites you own or rent vs. virtual destinations in public Clouds
• Techniques that help you test and fine tune recovery measures without disrupting production workloads
• Transferring responsibilities to the remote site
• Rapid restoration of normal operations at the primary facilities when conditions permit
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
More and more organizations are turning to Hadoop and NoSQL to manage big data. In fact, many IT professionals consider each of those terms to be synonymous with big data. At the same time, these two technologies are seen as different beasts that handle different challenges. That means they are often deployed in a rather disjointed way, even when intended to solve the same overarching business problem. The emerging trend of “in-Hadoop databases” promises to narrow the deployment gap between them and enable new enterprise applications. In this talk, Dale will describe that integrated architecture and how customers have deployed it to benefit both the technical and the business teams.
In this presentation we look at the use of the Hortonworks Data Flow (HDF) platform application in the finance sector operations. We define the process of deployment and use of the distribution as part of the operations and how the different components are integrated to support real-time payments and banking functionality. Documenting the challenges faced to allow a high performance system that promotes data integrity and real-time visualisation using the Hortonworks Data Platform (HDP). Focusing specially on the use of Apache Zeppelin workbooks across the business as they main information management tool. Our specific approach focus on creating a flexible system that allows fast prototyping and integrated visualisation, monitoring and audit. Demonstrating the use of the HDF distribution on the creation of business domain abstractions to in a real life application of a Domain Driven Design. We implement a cross platform use of Avro, as future prof model and language that is understood by the business and IT areas. Due to the scalability of the platform we can execute payments operations at high rate even across countries at the same time reusing the same architecture to monitor the business operations.
Speaker
Luis Caldeira, Chief Architect, Orwell Group
Gian Marco Cabiato, Head of Engineering, Orwell Group
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
NetApp receives 70 billion data points of telemetry information each day from its customer’s storage systems. This telemetry data contains configuration information, performance counters, and logs. All of this data is processed using multiple Hadoop clusters, and feeds a machine learning pipeline and a data serving infrastructure that produces insights for customers via an application called Active IQ. We describe the evolution of our Hadoop infrastructure from a traditional on-premises architecture to the hybrid cloud, and lessons learned.
We’ll discuss the insights we are able to produce for our customers, and the techniques used. Finally, we describe the data management challenges with our multi-petabyte Hadoop data lake. We solved these problems by building a unified data lake on-premises and using the NetApp Data Fabric to seamlessly connect to public clouds for data science and machine learning compute resources.
Architecting a truly hybrid cloud implementation allowed NetApp to free up our data scientists to use any software on any cloud, kept the customer log data safe on NetApp Private Storage in Equinix, resulted in faster ability to innovate and release new code and provided flexibility to use any public cloud at the same time with data on NetApp in Equinix.
Speaker
Pranoop Erasani, NetApp, Senior Technical Director, ONTAP
Shankar Pasupathy, NetApp, Technical Director, ACE Engineering
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
Speaker: Russ Savage, from Cask
Big Data Applications Meetup, 09/14/2016
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to talk: https://youtu.be/4j78g3WvC4Y
About the talk:
As data lake sizes grow, and more users begin exploring and including that data in their everyday analysis, keeping track of the sources for data becomes critical. Understanding how a dataset was generated and who is using it allows users and companies to ensure their analysis is leveraging the most accurate and up to date information. In this talk, we will explore the different techniques available to keep track of your data in your data lake and demonstrate how we at Cask approached and attempted to mitigate this issue.
Processing transactions is at the core of any bank’s business. Danske Bank’s journey started with recognising the value that could be gleaned from generating insights from the data to improve customer behaviour analytics. Today, the company streams large volumes of transactional data in near-real time onto its Hortonworks data Platform to improve fraud detection and customer marketing. In this session, Nadeem will outline the bank’s vision, how it was socialised across the executive board team and the resulting sponsorship, the technological path, challenges overcome and the results that have not only improved the customer experience but quantifiable metrics fraud and opening new revenue streams. Furthermore, Nadeem will cover future use cases around maintenance and operations.
We’re living in an era of digital disruption, where the accessibility and adoption of emerging digital technologies are enabling
enterprises to reimagine their businesses in exciting new ways. Data flows from the edge to the core to the cloud while
performing analytics and gaining actionable intelligence at all steps along the way. This connected, automated and data-driven
future enables organizations to rapidly acquire, analyze, and take action on real-time data as well as curate flows for additional
analysis at a later stage. New IoT use cases require enterprises to properly handle data in motion and create newer edge
applications with data flow management, stream processing and analytics while still being governed by existing enterprise
services.
This session highlights the importance of an edge-to-core-to-cloud digital infrastructure that can adapt to your flexing
business needs, capturing expanding data flows at the edge and aligning them to a core infrastructure that can drive insight.
Speakers
Bob Mumford, Hewlett Packard Enterprise, Big Data Solutions Architect
Teradata Technology Leadership and InnovationTeradata
Teradata is the world's leader in data warehousing and integrated marketing management through its database software, data warehouse appliances, and enterprise analytics. For more information, visit teradata.com.
The Most Trusted In-Memory database in the world- AltibaseAltibase
Life is a database. How you manage data defines business. ALTIBASE HDB with its Hybrid architecture combines the extreme speed of an In-Memory Database with the storage capacity of an On-Disk Database’ in a single unified engine.
ALTIBASE® HDB™ is the only Hybrid DBMS in the industry that combines an in-memory DBMS with an on-disk DBMS, with a single uniform interface, enabling real-time access to large volumes of data, while simplifying and revolutionizing data processing. ALTIBASE XDB is the world’s fastest in-memory DBMS, featuring unprecedented high performance, and supports SQL-99 standard for wide applicability.
Altibase is provider of In-Memory data solutions for real-time access, analysis and distribution of high volumes of data in mission-critical environments.
Please visit our website (www.altibase.com) to learn more about our products and read more about our case studies. Or contact us at info@altibase.com. We look forward to helping you!
WETEC is een professionele onderneming gespecialiseerd in infrastructurele projecten op
het gebied van high-end systemen, storage en netwerken. http://www.wetec.nl/
Learn how Aerospike's Hybrid Memory Architecture brings transactions and analytics together to power real-time Systems of Engagement ( SOEs) for companies across AdTech, financial services, telecommunications, and eCommerce. We take a deep dive into the architecture including use cases, topology, Smart Clients, XDR and more. Aerospike delivers predictable performance, high uptime and availability at the lowest total cost of ownership (TCO).
Fujitsu World Tour 2017 - Compute Platform For The Digital WorldFujitsu India
Significant performance increase combined with a rich feature set based on cutting edge technology results in compelling benefits across a broad variety of application scenarios.
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...inside-BigData.com
In this talk from the DDN User Group at ISC’13, James Coomer from DataDirect Networks presents: Massively-Scalable Platforms and Solutions Engineered for the Big Data and Cloud Era.
Watch the presentation here: http://insidehpc.com/2013/06/26/video-james-coomer-keynotes-ddn-user-group-at-isc13/
In-Memory and TimeSeries Technology to Accelerate NoSQL Analyticssandor szabo
The ability of Informix to combine the in
-
memor
y
performance of Informix Warehouse Accelerator
and the flexibility of TimeSeries and NoSQL
analytics positions it to be ready for the IoT era.
Similar to Teradata Extreme Data Applaince 1650 (20)
How to Use Algorithms to Scale Digital BusinessTeradata
Gartner defines digital business as the creation of new business designs by blurring the digital and physical worlds. Digital business creates new business opportunities, but the amount of data generated will eclipse the human ability to process it. Further, many complex decisions will need to be made in timeframes, and at scales, that are impossible by human actors. Gartner analyst Chet Geschickter will explain share advice on how to leverage algorithmic business principles to drive digital business success.
Humans are sentient. We perceive. We feel. We listen. The problem is the more you put together, the more we lose these capabilities. We get slower. The idea is, how we create a company that acts like a single organism, where we identify opportunities, and that allows us to work in a faster and exponential world world where development happens in months rather than years. Don't let digital transformation become a war of competitive attrition. You may need to invest in your future to change the game.
Teradata Listener™: Radically Simplify Big Data StreamingTeradata
Teradata Listener™ is an intelligent, self-service solution for ingesting and distributing extremely fast moving data streams throughout the analytical ecosystem. Listener
is designed to be the primary ingestion framework for organizations with multiple data streams. Listener reliably delivers data without loss and provides low-latency ingestion for near real-time applications.
Telematics data provides a wealth of new, actionable insights, particularly when integrated with other enterprise data. But where do you start? How do you prioritize? What is the roadmap? In an interactive workshop learn how to derive more from data so you can do more in your business.
- Find the value of integrating telematics data with traditional data elements, including financial, customer, manufacturing, location and weather data
- How integrated telematics data can improve customer satisfaction, lifecycle management, warranty reserves, supply chain performance, and even engineering & design choices
- Gain practical examples from top manufacturers to improve operational efficiencies, develop new revenue streams, create customer insights, and better understand product performance
The Tools You Need to Build Relationships and Drive Revenue Checklist Teradata
This Campaign Manager Leadership series paper provides a checklist for marketers when considering blending offline data with online data to improve the customer experience.
Right Message, Right Time: The Secrets to Scaling Email Success Teradata
This Campaign Manager Leadership Series ebook outlines the 4 keys to an automated email marketing strategy and how marketers can scale to meet these “always-on” customer expectations.
BSI Teradata: The Shocking Case of Home Electronics PlanetTeradata
Home Electronics Planet, a big-box retailer, has digital marketing campaigns that are failing. Their Chief Marketing Officer gets some analytics and data science help from Business Scenario Investigators who recommend changing their search keywords mix, creating tighter customer segments based on product purchase sequencing coupled with real-time web page personalizations, and revising their e-mail marketing to improve business results.
How we did it: BSI: Teradata Case of the Tainted LasagnaTeradata
Great Brands, a major food producer, faces yet another recall. The government is pointing at Turkey Broccoli Lasagna as the culprit, so the Chief Risk Officer and Chief Supply Chain officer bring in BSI investigators to help them build a better/faster track and trace system, using Big Data analytics.
To see more BSI: Teradata, go to http://www.facebook.com/bsiTeradata
Teradata BSI: Case of the Retail Turnaround Teradata
This set of Powerpoint slides describes the analytics work of Teradata: Business Scenario Investigation employees who help move Taylor & Swift, a big-box retailer, from a silo’d stores vs. web approach to an integrated Omni-Channel Retailing approach to customers, marketing, and sales. The team comes up with 5 ideas, 2 of which are tried out. The story illustrates the use Teradata, Aster, Aprimo, and Tableau as tools to glean faster and deeper analytical insights on Big Data, specifically web walks.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
7. Teradata Extreme Data Appliance 1650 > * Available in Half Cabinet Configurations * Nodes Two MPP Nodes Per Cabinet with Intel Westmere Six Core Xeon Processors Storage (144) 1TB or 2TB SAS Drives per Cabinet Total User Data Capacity 91.4TB per cabinet Scalability Scales to 180+ PB with 2TB drives Availability RAID 1 Disk Mirroring, Optional Fallback Data Protection and Cliquing, BAR Operating System SUSE Linux System Management Single Operational View Across Complete System Interconnect Teradata BYNET ®