Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks
If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
Hadoop is now doing a lot more than just storage and Map/Reduce and always improving and innovating. It brings near real time, interactive and cost efficient features to do Big Data.
Join us to hear about solutions based on Hadoop, how they responds to specific customer needs, with what component(s) from the Hadoop ecosystem, based on what HPE Reference Architecture(s) for the platform.
Hadoop solutions like, ETL offloading, Predictive Analytics, Ad hoc query, Complex Event processing, Stream processing, Search, Machine learning, Deep learning, …
Based on software components like, Spark, Hive, HBase, Kafka, Storm, Flume, Impala and Elastic Search.
Speaker
John Osborn, SA, Hewlett Packard Enterprise
Hadoop’s capabilities offer untapped potential for business insights but companies often get weighed down with DIY platforms and fail to keep up with the requirements. Join this Dell EMC session which will address this challenge with ready bundles to quickly deliver solutions for ETL offload, Single View, & IoT.
Get more value from your big data:
• Deploy big data applications faster
• Increase business agility
• Confidently deliver high performance and endless scale
• Improve IT operational efficiency
Speaker
Shawn Smith, Big Data Specialist, Dell EMC
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
During the Big Data Warehousing Meetup, we discussed options for enabling real-time/interactive queries to support business intelligence type functionality on Hadoop. Also, Hortonworks provided a deep-dive demo of Stinger! You can access that slideshow here: http://www.slideshare.net/CasertaConcepts/stinger-initiative-hortonworks
If you would like more information, please don't hesitate to contact us at info@casertaconcepts.com. Or, visit our website at http://casertaconcepts.com/.
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
Hadoop is now doing a lot more than just storage and Map/Reduce and always improving and innovating. It brings near real time, interactive and cost efficient features to do Big Data.
Join us to hear about solutions based on Hadoop, how they responds to specific customer needs, with what component(s) from the Hadoop ecosystem, based on what HPE Reference Architecture(s) for the platform.
Hadoop solutions like, ETL offloading, Predictive Analytics, Ad hoc query, Complex Event processing, Stream processing, Search, Machine learning, Deep learning, …
Based on software components like, Spark, Hive, HBase, Kafka, Storm, Flume, Impala and Elastic Search.
Speaker
John Osborn, SA, Hewlett Packard Enterprise
Hadoop’s capabilities offer untapped potential for business insights but companies often get weighed down with DIY platforms and fail to keep up with the requirements. Join this Dell EMC session which will address this challenge with ready bundles to quickly deliver solutions for ETL offload, Single View, & IoT.
Get more value from your big data:
• Deploy big data applications faster
• Increase business agility
• Confidently deliver high performance and endless scale
• Improve IT operational efficiency
Speaker
Shawn Smith, Big Data Specialist, Dell EMC
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
In a recent Big Data Warehousing Meetup in NYC, Caserta Concepts partnered with Datameer to explore big data analytics techniques. In the presentation, we made a Hive vs. Pig Comparison. For more information on our services or this presentation, please visit www.casertaconcepts.com or contact us at info (at) casertaconcepts.com.
http://www.casertaconcepts.com
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
Lisa McIntyre, Digital Asset Management Librarian at GSD&M, and Bob Canaway, CMO at Nuxeo present on the new use cases for digital asset management in the enterprise.
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasMongoDB
In this talk, Appboy co-founder and CIO Jon Hyman will discuss various schemas that Appboy has evolved to use on MongoDB, remaining agile as Appboy has grown to massive scale. Jon will discuss topics such as random sampling of documents, multivariate testing and multi-arm bandit optimization of such tests, field tokenization, and how Appboy stores multi-dimensional data on an individual user basis to be able to quickly optimize for the best time to deliver messages to end users. Appboy is the global leader in Marketing Automation for Apps, helping clients such as Urban Outfitters, Shutterfly, Kixeye, PicsArt, USA Today Sports, and iHeartRadio increase engagement through automated messaging. Each month, Appboy collects tens of billions of data points from hundreds of millions of monthly active users.
Real-time, Sensor-based Monitoring of Shipping Containersbenaam
This presentation describes a sensor-based solution for real-time monitoring of high-value assets in-transit so shippers can react quickly to unplanned events such as delays, cargo damage, and even thefts.
Selected as one of the best presentations at the 2012 Vail Computer Elements Workshop. For 42 years, this 4-day workshop has served leading architects of the computer industry. The agenda is 100% invited technical talks and the audience is mostly previous speakers.
Learn how to deconstruct what it means to be "Open," as well as how to engage developers, leverage users, and shape your data to make your platform ready for commercial use.
Presented April 14th, 2009, at BayCHI: http://www.baychi.org/calendar/20090414/
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
Greenplum is using Hadoop in several interesting ways as part of a larger big data architecture with EMC Greenplum Database (a scale-out MPP SQL database) and EMC Isilon (a scale-out network-attached storage appliance). After a quick introduction of Greenplum Database and Isilon, I list some ways Greenplum is tightly integrating with Hadoop and why we would want to do such a thing. Integration points discussed include: Greenplum Database external tables to seamlessly access data in HDFS, querying HBase tables natively from Greenplum Database, Greenplum Database having its underlying storage on HDFS, and Isilon OneFS as a seamless replacement for HDFS.
This guide explains how to implement an Aruba 802.11n wireless network that must provide high-speed access to an auditorium-style room with 500 or more seats. Aruba Networks refers to such networks as high-density wireless LANs (HD WLANs). Lecture halls, hotel ballrooms, and convention centers are common examples of spaces with this requirement. Because the number of concurrent users on an AP is limited, to serve such a large number of devices requires access point (AP) densities well in excess of the usual AP per 2,500 – 5,000 ft2 (225 – 450 m2). Such coverage areas therefore have many special technical design challenges. This validated reference design provides the design principles, capacity planning methods, and physical installation knowledge needed to successfully deploy HD WLANs.
Best practice strategies to clean up and maintain your database with Hether G...Blackbaud Pacific
In this webinar Hether Ghelf, Blackbaud Pacific’s Senior Consultant & Project Manager, discusses a best practice approach to database cleaning and continued maintenance.
Cleansing your data can have an immediate impact on your business by increasing retention and response rates, decreasing the volume of mail returned from post, and ensuring mail is reaching your organisation’s constituents.
View the recording here: https://www.blackbaud.com.au/notforprofit-events/webinars/past
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
Find out how Hortonworks and IBM help you address these challenges to enable success to optimize your existing EDW environment.
https://hortonworks.com/webinar/modernize-existing-edw-ibm-big-sql-hortonworks-data-platform/
Mr. Slim Baltagi is a Systems Architect at Hortonworks, with over 4 years of Hadoop experience working on 9 Big Data projects: Advanced Customer Analytics, Supply Chain Analytics, Medical Coverage Discovery, Payment Plan Recommender, Research Driven Call List for Sales, Prime Reporting Platform, Customer Hub, Telematics, Historical Data Platform; with Fortune 100 clients and global companies from Financial Services, Insurance, Healthcare and Retail.
Mr. Slim Baltagi has worked in various architecture, design, development and consulting roles at.
Accenture, CME Group, TransUnion, Syntel, Allstate, TransAmerica, Credit Suisse, Chicago Board Options Exchange, Federal Reserve Bank of Chicago, CNA, Sears, USG, ACNielsen, Deutshe Bahn.
Mr. Baltagi has also over 14 years of IT experience with an emphasis on full life cycle development of Enterprise Web applications using Java and Open-Source software. He holds a master’s degree in mathematics and is an ABD in computer science from Université Laval, Québec, Canada.
Languages: Java, Python, JRuby, JEE , PHP, SQL, HTML, XML, XSLT, XQuery, JavaScript, UML, JSON
Databases: Oracle, MS SQL Server, MYSQL, PostreSQL
Software: Eclipse, IBM RAD, JUnit, JMeter, YourKit, PVCS, CVS, UltraEdit, Toad, ClearCase, Maven, iText, Visio, Japser Reports, Alfresco, Yslow, Terracotta, Toad, SoapUI, Dozer, Sonar, Git
Frameworks: Spring, Struts, AppFuse, SiteMesh, Tiles, Hibernate, Axis, Selenium RC, DWR Ajax , Xstream
Distributed Computing/Big Data: Hadoop, MapReduce, HDFS, Hive, Pig, Sqoop, HBase, R, RHadoop, Cloudera CDH4, MapR M7, Hortonworks HDP 2.1
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
講者:Informatica 資深產品顧問 | 尹寒柏
議題簡介:Big Data 時代,比的不是數據數量,而是了解數據的深度。現在,因為 Big Data 技術的成熟,讓非資訊背景的 CXO 們,可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞,從 BI 進入 CI,更連結消費者經濟的脈動,洞悉顧客的意圖。不過,有個 Big Data 時代要 注意的思維,那就是競爭到最後,不單只是看數據量的增長,還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度,Informatica 也有能力提供更快速彙集數據技術,從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案,是精誠集團在 Big Data 時代的最佳工具。
Data is the fuel for the idea economy, and being data-driven is essential for businesses to be competitive. HPE works with all the Hadoop partners to deliver packaged solutions to become data driven. Join us in this session and you’ll hear about HPE’s Enterprise-grade Hadoop solution which encompasses the following
-Infrastructure – Two industrialized solutions optimized for Hadoop; a standard solution with co-located storage and compute and an elastic solution which lets you scale storage and compute independently to enable data sharing and prevent Hadoop cluster sprawl.
-Software – A choice of all popular Hadoop distributions, and Hadoop ecosystem components like Spark and more. And a comprehensive utility to manage your Hadoop cluster infrastructure.
-Services – HPE’s data center experts have designed some of the largest Hadoop clusters in the world and can help you design the right Hadoop infrastructure to avoid performance issues and future proof you against Hadoop cluster sprawl.
-Add-on solutions – Hadoop needs more to fill in the gaps. HPE partners with the right ecosystem partners to bring you solutions such an industrial grade SQL on Hadoop with Vertica, data encryption with SecureData, SAP ecosystem with SAP HANA VORA, Multitenancy with Blue Data, Object storage with Scality and more.
In 2017, more and more corporations are looking to reduce operational overheads in their enterprise data warehouse (EDW) installations. Hortonworks just launched Industry’s first turn key EDW Optimization solution together with our partners Syncsort and AtScale. Join Hortonworks’ CTO Scott Gnau to learn more about this exciting solution and its 3 use cases.
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...Precisely
So you built your Hadoop cluster. How do you get data from hundreds of database tables, streaming Kafka sources, and data shared by 20-year-old COBOL programs, all in there and working together quickly, efficiently and securely? With many customers asking this same question, Hortonworks recently expanded its partnership with Syncsort to provide optimized ETL onboarding for Hadoop. During this talk, we'll discuss how a next-generation ETL tool, built on contributions to the open source community and natively integrated in Hadoop, can drive lasting value for your organization. 1) Seamlessly onboard data from all your enterprise sources – batch and streaming -- into Hadoop for fast and easy analytics. 2) Stay agile and simplify your environment with a "design once, deploy anywhere" approach that minimizes disruption and risk in the face of a rapidly evolving big data ecosystem. 3) Secure, govern and manage your data with full integration with Apache Ambari, Apache Ranger, and more. These benefits come to life with real customer case studies. Learn how a national insurance company and global hotel chain are using Hortonworks HDP and Syncsort DMX-h to get bigger insights from their enterprise data, securely, efficiently, and cost-effectively, without spending hundreds of man-hours.
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
At VMware Corporate IT Data Solution and Delivery Team , we have built the Enterprise Advance Data Analytic Platform on Top of vSphere 6.0 with VMware BigData Extension, Isilon HDFS, Pivotal HD 3.0, Spring XD 1.2 and Alpine Data Lab
Cisco Big Data Warehouse Expansion Featuring MapR DistributionAppfluent Technology
Learn more about the Cisco Big Data Warehouse Expansion Solution featuring MapR Distribution including Apache Hadoop.
The BDWE solution begins with the collection of data usage statistics by Appfluent. Then the BDWE solution optimizes Cisco UCS hardware for running the MapR Distribution including Hadoop, software for federating multiple data sources, and a comprehensive services methodology for assessing, migrating, virtualizing, and operating a logically expanded warehouse.
View the recording:
http://hortonworks.com/webinar/accelerating-real-time-data-ingest-hadoop/
Hadoop didn’t disrupt the data center. The exploding amounts of data did. But, let’s face it, if you can’t move your data to Hadoop, then you can’t use it in Hadoop. The experts from Hortonworks, the #1 leader in Hadoop development, and Attunity, a leading data management software provider, cover:
- How to ingest your most valuable data into Hadoop using Attunity Replicate
- About how customers are using Hortonworks DataFlow (HDF) powered by Apache NiFi
- How to combine the real-time change data capture (CDC) technology with connected data platforms from Hortonworks
We discuss how Attunity Replicate and Hortonworks Data Flow (HDF) work together to move data into Hadoop.
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
Watch full webinar here: https://bit.ly/3aePFcF
Historically data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multipurpose data lakes are the future of data analysis for a broad range of business users.
Attend this session to learn:
- The restrictions of physical single purpose data lakes
- How to build a logical multi purpose data lake for business users
- The newer use cases that makes multi purpose data lakes a necessity
Gluent Extending Enterprise Applications with Hadoopgluent.
This presentation shows how to transparently extend enterprise applications with the power of modern data platforms such as Hadoop. Application re-writing is not needed and there is no downtime when virtualizing data with Gluent.
Similar to Hadoop from Hive with Stinger to Tez (20)
Slides for the Usergroup meeting for the Manchester Power BI User Group on June 27th, 2019.
Subject: Power BI for Developers about Power BI Embedded and Power BI Custom Visuals
Presentation giving as part of the Global Azure Bootcamp 2017, April 22, 2017. Subject: one-day hands-on workshop about the Cortana Intelligence Suite.
Presentation given at SQL Saturday Denmark (#541), October 15, 2016 about the Power BI REST API and Power BI Embedded.
Demo's are available at https://github.com/liprec/demos
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
2. 2
Introduction
Jan Pieter Posthuma
Microsoft Data Consultant
Rubicon, local consultancy firm in the Netherlands
Architect role at multiple projects
Analysis Service, Reporting Service, Big Data, HDInsight,
Cloud BI, Power BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jp.posthuma@rubicon.nl
4. 4
Hadoop
Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware:
‘store and process the data on the Internet in a simple, scalable and
economically feasible way’
Widely accepted by Database vendors as a solution for unstructured data
Microsoft partners with HortonWorks and delivers their Hadoop Data Platform
as Microsoft HDInsight (now on Windows and Linux)
Available on premise and as an Azure service
HortonWorks Data Platform (HDP) 100% Open Source!
5. 5
Why SQL on Hadoop?
Hadoop is great for cost, but
MapReduce is too difficult.
SQL on Hadoop makes
Hadoop real and gives me
scale that traditional SQL
can’t offer.
I’m deleting important data
because it’s too expensive to
store it.
$
6. 6
Hive
Developed Hive to address traditional RDBMS limitations.
300+ PB of data under management.
600+ TB of data loaded daily.
60,000+ Hive queries per day.
More than 1,000 users per day.
Initial Apache release in April 2009
Problem: Hive is bound to MapReduce leading to latency
and needs higher performance
7. 7
Stinger
‘Making Apache Hive 100 Times Faster’
Hortonworks blog, February 2013
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORCFile
= 100X+ +
Distributed
Execution
Apache Tez
8. 8
ORCFiles
Started by HortonWorks to optimize existing RCFiles with input
from Microsoft to cooperate with QE and Tez
Two goals:
Improve query speed
Improve storage efficiency
CREATE TABLE … STORED AS ORC
11. 11
Stinger TPC-DS Benchmark at 30 Terabyte Scale
Sample of 50 queries from TPC-DS at 30 terabyte scale.
Average 52x Query Speedup, Maximum 160x Query Speedup.
Total benchmark time decreased from 7.8 days to 9.3 hours.(3)
Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
12. 12
Stinger.Next
Stinger.Next (in 3 phases)
Transactions with ACID semantics – allow users to easily modify data with
inserts, updates and deletes. It extend Hive from the traditional write-
once, and read-often system to support analytics over changing data.
Sub-second queries – allow users to deploy Hive for interactive
dashboards and explorative analytics that have more demanding
response-time requirements. Emerge of LLAP (Live Long and Process) and
Hive on Spark.
SQL:2011 Analytics – allows rich reporting to be deployed on Hive faster,
more simply and reliably using standard SQL. A powerful cost based
optimizer ensures complex queries and tool-generated queries run fast.
Hive now provides the full expressive power that enterprise SQL users
have enjoyed, but at Hadoop scale.
13. 13
Storage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In Develop
ment
Legend
Apache Hive: Modern Architecture
15. 15
Links
Microsoft Big Data:
http://www.microsoft.com/bigdata
Hortonworks:
http://www.hortonworks.com
Try your self via Windows Azure HDInsight:
http://azure.com/hdinsight
Based on Google’s academical papers (2003) about distributed storage (HDFS) and extracting data (MapReduce).
Hadoop started in 2005.
Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.
Launched: February 2013; Delivered: April 2014.
Delivered in 100% Apache Open Source.
Baseline Hive 0.10 and is delivered in 18 months in three phases:
1. Introducing ORC Files (Optimized Row Columnar)
2. Vectorized Query Engine
3. Hive and Tez
http://hortonworks.com/blog/100x-faster-hive/
ORC File structure:
Default stripe is 250MB
File footer:
Stripe location
Stripe row count and data types of each column
Statistics of each column (min, max, count and sum)
Postscript:
Compression parameters
Stripe Index:
Min and Max values for each column
Row index for position in file (default 10.000 rows)
Stripe Footer:
Column stream locations (like Row Data, Nullable and Dictionaries)
Column encoding
YARN (Yet Another Resource Negotiator).
MapReduce is both data processor and cluster resource manager and central managed via the job tracker on the headnode. (Inefficient)
Yarn splits the JobTracker into a Resource Manager (headnode) and a Node Manager. Each node can communicate with another node and shares statuses.
Tez Sessions
– Hot containers ready for immediate use
– Removes task and job launch overhead (~5s – 30s)
Hive
– Session launch/shutdown in background (seamless, user not aware)
– Submits query plan directly to Tez Session
- Tez models data processing as a dataflow graph, with the graph vertices representing application logic and its edges representing movement of data.
- Tez models the user logic running in each vertex of the dataflow graph as a composition of Input, Processor and Output modules.
- YARN manages resources in a Hadoop cluster, based on cluster capacity and load.
In short: Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf.
Stinger.Next. Started in second half of 2014. Phase 1 is delivered, so ACID transactions are now possible via a delta mechanism.
Next two phases are scheduled to be released in 2015 (1st and 2nd half).