This document provides an agenda and overview of a talk on big data and data science given by Peter Wang. The key points covered include:
- An honest perspective on big data trends and challenges over time.
- Architecting systems for data exploration and analysis using tools like Continuum Analytics' Blaze and Numba libraries.
- Python's role in data science for its ecosystem of libraries and accessibility to domain experts.
Wes McKinney gave a talk at the 2015 Open Data Science Conference about data frames and the state of data frame interfaces across different languages and libraries. He discussed the challenges of collaboration between different data frame communities due to the tight coupling of user interfaces, data representations, and computation engines in current data frame implementations. McKinney predicted that over time these components would decouple and specialize, improving code sharing across languages.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
Wes McKinney is the creator of Python's pandas project and a primary developer of Apache Arrow, Apache Parquet, and other open-source projects. Apache Arrow is an open-source cross-language development platform for in-memory analytics that aims to improve data science tools. It provides a shared standard for memory interoperability and computation across languages through its columnar memory format and libraries. Apache Arrow has growing adoption in data science systems and is working to expand language support and computational capabilities.
This document discusses Apache Arrow, an open source project that aims to standardize in-memory data representations to enable efficient data sharing across systems. It summarizes Arrow's goals of improving performance by 10-100x on many workloads through a common data layer, reducing serialization overhead. The document outlines Arrow's language bindings for Java, C++, Python, R, and Julia and efforts to integrate Arrow with systems like Spark, Drill and Impala to enable faster analytics. It encourages involvement in the Apache Arrow community.
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
This document discusses the past, present, and future of Python for big data analytics. It provides background on the rise of Python as a data analysis tool through projects like NumPy, pandas, and scikit-learn. However, as big data systems like Hadoop became popular, Python was not initially well-suited for problems at that scale. Recent projects like PySpark, Blaze, and Spartan aim to bring Python to big data, but challenges remain around data formats, distributed computing interfaces, and competing with Scala. The document calls for continued investment in high performance Python tools for big data to ensure its relevance in coming years.
This presentation covers "Introduction to Big Data" for enterprises. It includes challenges and benefits of Big Data including transition plan based on few case studies.
Data Science Languages and Industry AnalyticsWes McKinney
September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).
Introduction to Big Data Technologies & ApplicationsNguyen Cao
Big Data Myths, Current Mainstream Technologies related to Collecting, Storing, Computing & Stream Processing Data. Real-life experience with E-commerce businesses.
Wes McKinney gave a talk at the 2015 Open Data Science Conference about data frames and the state of data frame interfaces across different languages and libraries. He discussed the challenges of collaboration between different data frame communities due to the tight coupling of user interfaces, data representations, and computation engines in current data frame implementations. McKinney predicted that over time these components would decouple and specialize, improving code sharing across languages.
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
Wes McKinney is the creator of Python's pandas project and a primary developer of Apache Arrow, Apache Parquet, and other open-source projects. Apache Arrow is an open-source cross-language development platform for in-memory analytics that aims to improve data science tools. It provides a shared standard for memory interoperability and computation across languages through its columnar memory format and libraries. Apache Arrow has growing adoption in data science systems and is working to expand language support and computational capabilities.
This document discusses Apache Arrow, an open source project that aims to standardize in-memory data representations to enable efficient data sharing across systems. It summarizes Arrow's goals of improving performance by 10-100x on many workloads through a common data layer, reducing serialization overhead. The document outlines Arrow's language bindings for Java, C++, Python, R, and Julia and efforts to integrate Arrow with systems like Spark, Drill and Impala to enable faster analytics. It encourages involvement in the Apache Arrow community.
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
This document discusses the past, present, and future of Python for big data analytics. It provides background on the rise of Python as a data analysis tool through projects like NumPy, pandas, and scikit-learn. However, as big data systems like Hadoop became popular, Python was not initially well-suited for problems at that scale. Recent projects like PySpark, Blaze, and Spartan aim to bring Python to big data, but challenges remain around data formats, distributed computing interfaces, and competing with Scala. The document calls for continued investment in high performance Python tools for big data to ensure its relevance in coming years.
This presentation covers "Introduction to Big Data" for enterprises. It includes challenges and benefits of Big Data including transition plan based on few case studies.
Data Science Languages and Industry AnalyticsWes McKinney
September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).
Introduction to Big Data Technologies & ApplicationsNguyen Cao
Big Data Myths, Current Mainstream Technologies related to Collecting, Storing, Computing & Stream Processing Data. Real-life experience with E-commerce businesses.
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
Wes McKinney gives an overview of the current data analysis tools landscape in Python and R. He discusses essential Python packages like NumPy, pandas, and scikit-learn. For R, he covers packages in the "Hadley stack" like dplyr and ggplot2. IPython/Jupyter notebooks are also mentioned as a platform for interactive data analysis across languages. The talk aims to highlight trends, opportunities, and challenges in the open source data science tool ecosystem.
Memory Interoperability in Analytics and Machine LearningWes McKinney
Wes McKinney gave a talk on Apache Arrow, an open source project for memory interoperability between analytics and machine learning systems. Arrow provides efficient columnar memory structures and zero-copy sharing of data between applications. It defines common data types and schemas that can be used across programming languages. Arrow is implemented in C++ and provides language bindings for other languages like Python. It aims to improve performance for tasks like data loading, preprocessing, modeling and serving. Projects like pandas, Spark and Ray are exploring using Arrow internally for more efficient data handling.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
Data ingestion using NiFi - Quick OverviewDurga Gadiraju
* Overview of NiFi
* Understanding NiFi Layout as a service
* Key Concepts such as Flow Files, Attributes etc
* Understanding how to access the documentation
* Capabilities of NiFi as a Data Ingestion Tool
* NiFi vs. Traditional ETL Tools
* Role of NiFi in Data Engineering at Scale
* Simple pipeline to copy files from Local File System and HDFS
The document discusses the evolution of big data architectures from Hadoop and MapReduce to Lambda architecture and stream processing frameworks. It notes the limitations of early frameworks in terms of latency, scalability, and fault tolerance. Modern architectures aim to unify batch and stream processing for low latency queries over both historical and new data.
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
Wes McKinney gave a keynote talk at SciPy 2015 about his journey with Python for data analysis from 2007 to present day. He started as a mathematician with no exposure to Python or data analysis tools. His first job was at a quant hedge fund where he encountered frustrations with productivity due to extensive use of SQL and Excel. In 2008, he began experimenting with Python and created early versions of pandas to improve productivity for his projects. This led to open sourcing pandas in 2009 and evangelizing Python more broadly within his company and community.
This document summarizes key differences between SQL and NoSQL databases. It discusses the history and advantages of SQL, including its ubiquitous use, standardization, and ability to optimize queries based on schema. However, SQL has limitations for sparse and complex data. The document then introduces NoSQL databases and how they are designed for scale-out. It uses HBase as an example to illustrate how NoSQL databases can solve SQL's distributed join challenges by co-locating related data without schemas. While this improves performance, it also has disadvantages for modeling relationships and exploring data.
Wes McKinney gave a talk at Data Day Texas 2015 about the past, present, and future of the Python data analysis community and ecosystem, known as PyData. He discussed how Python became a popular language for data analysis due to tools like NumPy, pandas, scikit-learn, and IPython that enabled interactive exploration and modeling of data. However, Python was not initially well-suited for large-scale "big data" problems involving Hadoop and Spark. Recent developments in PySpark and improved integration of Python with big data frameworks have helped address this, but challenges remain in data structures, emerging formats, and open-sourcing of Python big data solutions.
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
Apache Arrow is an open standard for in-memory columnar data and an analytical data processing platform. It aims to simplify system architectures, improve interoperability between systems, and enable data and algorithms to be reused across different programming languages. Arrow provides a portable in-memory data format and computational libraries to build analytical data processing systems. It is language-independent and supports data sharing and algorithm reuse between libraries and processes via shared memory with near-zero overhead.
Ibis: Scaling the Python Data ExperienceWes McKinney
Ibis is a new open source project that allows Python data scientists to analyze large datasets using the same Python code and tools they use for smaller datasets. Ibis provides a high-level Python API for describing analytics and ETL processes that can be executed by Impala for scalability. The beta release of Ibis aims to maximize productivity for data engineers and scientists by enabling them to solve big data problems without leaving the familiar Python environment. Future roadmap items include better support for complex data types and machine learning as well as improved integration with the Python data science ecosystem.
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
Wes McKinney gives a presentation on the Python data ecosystem and building open source communities. He discusses his background working on Python data tools like pandas and Apache projects. McKinney emphasizes the importance of transparency, consensus building, and valuing all contributions when developing open source software. He also examines challenges in Python packaging and sees opportunities in building bridges between data science languages and tools for analyzing new data types and storage technologies.
The document discusses big data and Hadoop. It describes the three V's of big data - variety, volume, and velocity. It also discusses Hadoop components like HDFS, MapReduce, Pig, Hive, and YARN. Hadoop is a framework for storing and processing large datasets in a distributed computing environment. It allows for the ability to store and use all types of data at scale using commodity hardware.
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
We’ve all heard that AI is going to become as ubiquitous in the enterprise as the telephone, but what does that mean exactly?
Everyone in IBM has a telephone; and everyone knows how to use her telephone; and yet IBM isn’t a phone company. How do we bring AI to the same standard of ubiquity — where everyone in a company has access to AI and knows how to use AI; and yet the company is not an AI company?
In this talk, we’ll break down the challenges a domain expert faces today in applying AI to real-world problems. We’ll talk about the challenges that a domain expert needs to overcome in order to go from “I know a model of this type exists” to “I can tell an application developer how to apply this model to my domain.”
We’ll conclude the talk with a live demo that show cases how a domain expert can cut through the five stages of model deployment in minutes instead of days using IBM and other open source tools.
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
Wes McKinney is a leading open source developer who created Python's pandas library and now leads the Apache Arrow project. Apache Arrow is an open standard for in-memory analytics that aims to improve data sharing and reuse across systems by defining a common columnar data format and memory layout. It allows data to be accessed and algorithms to be reused across different programming languages with near-zero data copying. Arrow is being integrated into various data systems and is working to expand its computational libraries and language support.
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
A presentation by Tomer Shiran, CEO of Dremio made to Hadoop User Group (HUG) Ireland on "Hadoop Summit Night" on April 12th, 2016. This presentation covers Apache Arrow in detail.
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
Apache Arrow: Leveling Up the Analytics StackWes McKinney
This document discusses the development of Apache Arrow, an open source in-memory data format designed for efficient analytical data processing on modern hardware. It provides a brief history of big data and analytics technologies leading to the need for Arrow. Key points about Arrow include that it aims to eliminate data serialization, enable code sharing across languages, and has over 400 contributors representing 11 programming languages. Notable subcomponents include DataFusion, Gandiva, and Plasma; and development is supported by organizations like Ursa Labs.
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
The document discusses how complex data structures can be modeled in a database using an extended relational model. It begins with an agenda that includes discussing loose typing, examples of what can be done, and looking at a real database with 10-20x fewer tables. It then contrasts the traditional relational model with HBase and discusses how structuring allows complex objects in fields and references between objects. Examples are given of modeling time-series data and music metadata in fewer tables using these techniques. Apache Drill is presented as a way to perform SQL queries over these complex data structures.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
This document discusses data engineering. It defines data engineering as software engineering focused on dealing with large amounts of data. It explains why data engineering has become important now due to advances in technology and economics. The document then discusses data engineering concepts like distributed systems, parallel processing, and databases. It provides an example of a data pipeline that collects tweets and processes them. Finally, it discusses qualities of an ideal data engineer.
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
Wes McKinney gives an overview of the current data analysis tools landscape in Python and R. He discusses essential Python packages like NumPy, pandas, and scikit-learn. For R, he covers packages in the "Hadley stack" like dplyr and ggplot2. IPython/Jupyter notebooks are also mentioned as a platform for interactive data analysis across languages. The talk aims to highlight trends, opportunities, and challenges in the open source data science tool ecosystem.
Memory Interoperability in Analytics and Machine LearningWes McKinney
Wes McKinney gave a talk on Apache Arrow, an open source project for memory interoperability between analytics and machine learning systems. Arrow provides efficient columnar memory structures and zero-copy sharing of data between applications. It defines common data types and schemas that can be used across programming languages. Arrow is implemented in C++ and provides language bindings for other languages like Python. It aims to improve performance for tasks like data loading, preprocessing, modeling and serving. Projects like pandas, Spark and Ray are exploring using Arrow internally for more efficient data handling.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
Data ingestion using NiFi - Quick OverviewDurga Gadiraju
* Overview of NiFi
* Understanding NiFi Layout as a service
* Key Concepts such as Flow Files, Attributes etc
* Understanding how to access the documentation
* Capabilities of NiFi as a Data Ingestion Tool
* NiFi vs. Traditional ETL Tools
* Role of NiFi in Data Engineering at Scale
* Simple pipeline to copy files from Local File System and HDFS
The document discusses the evolution of big data architectures from Hadoop and MapReduce to Lambda architecture and stream processing frameworks. It notes the limitations of early frameworks in terms of latency, scalability, and fault tolerance. Modern architectures aim to unify batch and stream processing for low latency queries over both historical and new data.
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
Wes McKinney gave a keynote talk at SciPy 2015 about his journey with Python for data analysis from 2007 to present day. He started as a mathematician with no exposure to Python or data analysis tools. His first job was at a quant hedge fund where he encountered frustrations with productivity due to extensive use of SQL and Excel. In 2008, he began experimenting with Python and created early versions of pandas to improve productivity for his projects. This led to open sourcing pandas in 2009 and evangelizing Python more broadly within his company and community.
This document summarizes key differences between SQL and NoSQL databases. It discusses the history and advantages of SQL, including its ubiquitous use, standardization, and ability to optimize queries based on schema. However, SQL has limitations for sparse and complex data. The document then introduces NoSQL databases and how they are designed for scale-out. It uses HBase as an example to illustrate how NoSQL databases can solve SQL's distributed join challenges by co-locating related data without schemas. While this improves performance, it also has disadvantages for modeling relationships and exploring data.
Wes McKinney gave a talk at Data Day Texas 2015 about the past, present, and future of the Python data analysis community and ecosystem, known as PyData. He discussed how Python became a popular language for data analysis due to tools like NumPy, pandas, scikit-learn, and IPython that enabled interactive exploration and modeling of data. However, Python was not initially well-suited for large-scale "big data" problems involving Hadoop and Spark. Recent developments in PySpark and improved integration of Python with big data frameworks have helped address this, but challenges remain in data structures, emerging formats, and open-sourcing of Python big data solutions.
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
Apache Arrow is an open standard for in-memory columnar data and an analytical data processing platform. It aims to simplify system architectures, improve interoperability between systems, and enable data and algorithms to be reused across different programming languages. Arrow provides a portable in-memory data format and computational libraries to build analytical data processing systems. It is language-independent and supports data sharing and algorithm reuse between libraries and processes via shared memory with near-zero overhead.
Ibis: Scaling the Python Data ExperienceWes McKinney
Ibis is a new open source project that allows Python data scientists to analyze large datasets using the same Python code and tools they use for smaller datasets. Ibis provides a high-level Python API for describing analytics and ETL processes that can be executed by Impala for scalability. The beta release of Ibis aims to maximize productivity for data engineers and scientists by enabling them to solve big data problems without leaving the familiar Python environment. Future roadmap items include better support for complex data types and machine learning as well as improved integration with the Python data science ecosystem.
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
Wes McKinney gives a presentation on the Python data ecosystem and building open source communities. He discusses his background working on Python data tools like pandas and Apache projects. McKinney emphasizes the importance of transparency, consensus building, and valuing all contributions when developing open source software. He also examines challenges in Python packaging and sees opportunities in building bridges between data science languages and tools for analyzing new data types and storage technologies.
The document discusses big data and Hadoop. It describes the three V's of big data - variety, volume, and velocity. It also discusses Hadoop components like HDFS, MapReduce, Pig, Hive, and YARN. Hadoop is a framework for storing and processing large datasets in a distributed computing environment. It allows for the ability to store and use all types of data at scale using commodity hardware.
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
We’ve all heard that AI is going to become as ubiquitous in the enterprise as the telephone, but what does that mean exactly?
Everyone in IBM has a telephone; and everyone knows how to use her telephone; and yet IBM isn’t a phone company. How do we bring AI to the same standard of ubiquity — where everyone in a company has access to AI and knows how to use AI; and yet the company is not an AI company?
In this talk, we’ll break down the challenges a domain expert faces today in applying AI to real-world problems. We’ll talk about the challenges that a domain expert needs to overcome in order to go from “I know a model of this type exists” to “I can tell an application developer how to apply this model to my domain.”
We’ll conclude the talk with a live demo that show cases how a domain expert can cut through the five stages of model deployment in minutes instead of days using IBM and other open source tools.
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
Wes McKinney is a leading open source developer who created Python's pandas library and now leads the Apache Arrow project. Apache Arrow is an open standard for in-memory analytics that aims to improve data sharing and reuse across systems by defining a common columnar data format and memory layout. It allows data to be accessed and algorithms to be reused across different programming languages with near-zero data copying. Arrow is being integrated into various data systems and is working to expand its computational libraries and language support.
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
A presentation by Tomer Shiran, CEO of Dremio made to Hadoop User Group (HUG) Ireland on "Hadoop Summit Night" on April 12th, 2016. This presentation covers Apache Arrow in detail.
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
Apache Arrow: Leveling Up the Analytics StackWes McKinney
This document discusses the development of Apache Arrow, an open source in-memory data format designed for efficient analytical data processing on modern hardware. It provides a brief history of big data and analytics technologies leading to the need for Arrow. Key points about Arrow include that it aims to eliminate data serialization, enable code sharing across languages, and has over 400 contributors representing 11 programming languages. Notable subcomponents include DataFusion, Gandiva, and Plasma; and development is supported by organizations like Ursa Labs.
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
The document discusses how complex data structures can be modeled in a database using an extended relational model. It begins with an agenda that includes discussing loose typing, examples of what can be done, and looking at a real database with 10-20x fewer tables. It then contrasts the traditional relational model with HBase and discusses how structuring allows complex objects in fields and references between objects. Examples are given of modeling time-series data and music metadata in fewer tables using these techniques. Apache Drill is presented as a way to perform SQL queries over these complex data structures.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
This document discusses data engineering. It defines data engineering as software engineering focused on dealing with large amounts of data. It explains why data engineering has become important now due to advances in technology and economics. The document then discusses data engineering concepts like distributed systems, parallel processing, and databases. It provides an example of a data pipeline that collects tweets and processes them. Finally, it discusses qualities of an ideal data engineer.
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
Nicholas Berg presented on Seagate's use of big data analytics to manage the large amount of manufacturing data generated from its hard drive production. Seagate collects terabytes of data per day from testing its drives, which it analyzes using Hadoop to improve quality, predict failures, and gain other insights. It faces challenges in integrating this emerging platform due to the rapid evolution of Hadoop and lack of tools to fully leverage large datasets. Seagate is developing its data lake and data science capabilities on Hadoop to better optimize manufacturing and drive design.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
The webinar discusses how organizations can make big data easy to use with the right tools and talent. It presents on MetaScale's expertise in helping Sears Holdings implement Hadoop and how Kognitio's in-memory analytics platform can accelerate Hadoop for organizations. The webinar agenda includes an introduction, a case study on Sears Holdings' Hadoop implementation, an explanation of how Kognitio's platform accelerates Hadoop, and a Q&A session.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
This document discusses building data pipelines with Apache Spark and DataStax Enterprise (DSE) for both static and real-time data. It describes how DSE provides a scalable, fault-tolerant platform for distributed data storage with Cassandra and real-time analytics with Spark. It also discusses using Kafka as a messaging queue for streaming data and processing it with Spark. The document provides examples of using notebooks, Parquet, and Akka for building pipelines to handle both large static datasets and fast, real-time streaming data sources.
NoSQL and SQL databases can work together to handle real-time big data needs. Apache Drill is an open source tool that allows interactive analysis of big data using standard SQL queries across NoSQL, Hadoop, and relational data sources. It provides low-latency queries, full ANSI SQL support, and flexibility to handle rapidly evolving schemas and data in different systems. By enabling analysis of all data together using a common interface, it helps tackle challenges of combining operational and decision support systems on big, diverse datasets.
Apache Cassandra training. Overview and BasicsOleg Magazov
This document provides an overview of Apache Cassandra, including:
- Its history originating from Facebook's need to solve an inbox search problem.
- Its key features like high availability, linear scalability, fault tolerance and tunable consistency.
- Its architecture based on consistent hashing and a ring topology for data distribution.
- Its data model using keyspaces, column families, rows, and columns differently than a relational database.
- Examples of using the Cassandra CLI to create a schema, insert data, and perform queries.
An overview of modern scalable web developmentTung Nguyen
The document provides an overview of modern scalable web development trends. It discusses the motivation to build systems that can handle large amounts of data quickly and reliably. It then summarizes the evolution of software architectures from monolithic to microservices. Specific techniques covered include reactive system design, big data analytics using Hadoop and MapReduce, machine learning workflows, and cloud computing services. The document concludes with an overview of the technologies used in the Septeni tech stack.
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
Challenges in Data Analytics:
Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabaseKinetica
Freed from the constraints of storage, network and memory, many big data analytics systems now are routinely revealing themselves to be compute bound. To compensate, big data analytic systems often result in wide horizontal sprawl (300-node Spark or NoSQL clusters are not unusual!)— to bring in enough compute for the task at hand. High system complexity and crushing operational costs often result. As the world shifts from physical to virtual assets and methods of engagement, there is an increasing need for systems of intelligence to live alongside the more traditional systems of record and systems of analysis. New approaches to data processing are required to support the real-time processing of data required to drive these systems of intelligence.
Join 451 Research and Kinetica to learn:
•An overview of the business and technical trends driving widespread interest in real-time analytics
•Why systems of analysis need to be transformed and augmented with systems of intelligence bringing new approaches to data processing
•How a new class of solution—a GPU-accelerated, scale out, in-memory database–can bring you orders of magnitude more compute power, significantly smaller hardware footprint, and unrivaled analytic capabilities.
•Hear how other companies in a variety of industries, such as financial services, entertainment, pharmaceutical, and oil and gas, benefit from augmenting their legacy systems with a modern analytics database.
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
Wes McKinney gave a talk on Apache Arrow and the future of data frames. He discussed how Arrow aims to standardize columnar data formats and reduce inefficiencies in data processing. It defines an efficient binary format for transferring data between systems and programming languages. As more tools support Arrow natively, it will become more efficient to process data directly in Arrow format rather than converting between data structures. Arrow is gaining adoption in popular data tools like Spark, BigQuery, and InfluxDB to improve performance.
The document discusses the role and responsibilities of a data architect. It provides information on the high demand and salaries for data architects, which can be over $200,000 at companies like Microsoft. The summary also outlines some of the key technical skills required for the role, including strong data modeling abilities, knowledge of databases, ETL tools, analytics dashboards, and programming languages like SQL, Python and R. Business skills like communication and presenting complex concepts are also important.
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
Presenter: John Andleman, Staff Database Engineer, Citrix
In this session, John will share some interesting use cases leveraging the HPCC Systems platform, including those beyond traditional big data uses. John will also share his roadmap of HPCC projects being planned for the next few months and why he feels HPCC Systems is a more suitable solution than Hadoop based on experiences and lessons learned.
NOTE: The video of this presentation is the 3rd one shown in the accompanying YouTube link.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
2. Agenda
• Big Data - An honest perspective
• Architecting for Data
• Continuum’s Tools
• DARPA & Data Science
3. About Peter
• Co-founder & President at Continuum
• Author of several Python libraries & tools
• Scientific, financial, engineering HPC using
Python, C, C++, etc.
• InteractiveVisualization
• Organizer of Austin Python
• Background in Physics (BA Cornell ’99)
8. The Players
• Data Processing & Low-level infrastructure
• Traditional BI vendors
• New BI startups
• Data-oriented startups
• Analytics-as-a-service
• “Big data” infrastructure platforms (DB &
analytical compute as a service)
10. •Diversification away from SQL & relational DBs
•“Messy” data, agile data processing
•Dynamic schema management
•Acknowledgement of heterogenous data environment
•Focus on high performance
•Richer simulations, processing more data
•Modern hardware revolution (SSDs, GPUs, etc.)
•Advanced visualization
•Interactive, novel plots
•Beyond simple reports and dashboards
•Advanced analytics
•Richer statistical models, Bayesian approaches
•Machine learning
•Predictive databases
Observed Trends
11. Data Revolution
“Internet Revolution” True Believer, 1996:
Businesses that build network-oriented capability
into their core will fundamentally outcompete and
destroy their competition.
“Data Revolution” True Believer, 2010:
Businesses that build data comprehension into
their core will destroy their competition over the
next 5-10 years
12. Opportunities
• Advanced ML & Predictive DBs will provide
transformative insights to nearly every business.
• Mobile & hi-speed connectivity means more dimensions
of customer life are being digitized.
• Every bit of new data makes old data more valuable
• Analyzing historical data becomes more important
• Developing internal data analysis capability means you
can more easily build data products to sell downstream.
• This is becoming an industry unto itself.
13. Technical Challenges
• Hardware & software do not yet make data analysis
easy at terabyte scales
• Current analytics are mostly I/O bound. Next gen
“advanced” analytics will be compute bound
(simulations, distributed LinAlg). Efficiency matters.
• Reproducible analytical environment
• Library & language choices can add “air gaps” between
domain expert and analytical infrastructure.
14. Business Challenges
• Data exploration is new discipline for most businesses
• Balancing agility & process for data-oriented processes
and analytical libraries.
• Bad data architecture will generally not cause
catastrophic failures
• Instead, will erode your ability to compete.
It’s hard to know when you are sucking.
15. Data Matters
• Data has mass.
• Scalability requires minimizing data-movement (only
as necessary).
• Deep/Advanced Analytics needs full computing
stack, as accessible as SQL and Excel
• Data should only move when it has to (to
communicate results, to replicate, to back-up) not
because the technology doesn’t allow access.
16. Algorithms Matter
...a Mac Mini running GraphChi can
analyze Twitter’s social graph from 2010
—which contains 40 million users and 1.2
billion connections—in 59 minutes.“The
previous published result on this problem
took 400 minutes using a cluster of about
1,000 computers,” Guestrin says.
-- MITTech Review
“...Spark, running on a cluster of 50
machines (100 CPUs) runs five iterations
of Pagerank on the twitter-2010 in 486.6
seconds. GraphChi solves the same
problem in less than double of the time
(790 seconds), with only 2 CPUs.”
18. Memory Matters
1980s 90s-00s 2010s
implemented several memory lay-
ers with different capabilities: lower-
level caches (that is, those closer to
of memory hierarchy (for an example
in progress, see the Sequoia project
at www.stanford.edu/group/sequoia),
Programmers should exploit the op-
timizations inherent in temporal and
spatial locality as much as possible.
Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current
implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade:
three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks.
Mechanical disk Mechanical disk Mechanical disk
Speed
Capacity
Solid state disk
Main memory
Level 3 cache
Level 2 cache
Level 1 cache
Level 2 cache
Level 1 cache
Main memoryMain memory
CPUCPU
(a) (b) (c)
Central
processing
unit (CPU)
20. Jeff Hammerbacher’s Advice
• Instrument everything
• Put all your data in one place
• Data first, questions later
• Store first, structure later (often the data model is
dependent on the analysis you'd like to perform)
• Keep raw data forever
• Let everyone party on the data
• Introduce tools to support the whole research cycle
(think of the scope of the product as the entire cycle, not
just the container)
• Modular and composable infrastructure
21. Architecting for Data
Data exploration as the central task.
Data visualization as a first-class citizen.
Enable agility.
24. • Easy for domain experts to learn
• Powerful enough for software devs to
build backend infrastructure
• Mature and broad ecosystem of libraries
enables rich applications and scripts
(over 28,500 packages on PyPI).
• Very large community of users
• Syntax matters
Why Python?
25. Rich enough syntax & features to do
powerful, high-level things;
Easily extensible via C/C++/Fortran to
optimize low-level things;
Connects to existing infrastructure with
extremely large, capable third-party library
support.
Key Strengths
26. • Machine learning, statistical processing
• Text analytics
• Graph analysis
• Integration with Hadoop + additional map-
reduce and distributed data paradigms
• Over a decade of use in scientific computing
• Very popular among data scientists and
“regular” scientists
Python in Big Data
27. •Python for data analysis, big data, BI
•Uses much of SciPy, but adds libraries for
machine learning & advanced analytics
•Developer & user community
•Next conferences:
•Boston, July 27-28, 2013
•NYC & London, October/Nov 2013
•Looking for sponsors, local chapters, etc.
PyData
28. Python in Enterprise
• “Up & coming” technology; advocates are generally
early adopters and thought leaders
• Classic languages like Java, C# are safe bets;
frequently used for low-risk projects
• Python & others are used for innovation or
disruptive “skunk works” projects
• Potential impedance mismatch with some
organizations & dev groups
• No silver bullets
29. Domains
• Finance
• Geophysics
• Defense
• Advertising metrics & data analysis
• Scientific computing
Technologies
• Array/Columnar data processing
• Distributed computing, HPC
• GPU and new vector hardware
• Machine learning, predictive analytics
• InteractiveVisualization
Enterprise
Python
Scientific
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
Continuum Analytics
30. • Out-of-core, distributed data computation
• Interactive visualization of massive datasets
• Advanced, powerful analytics, accessible to
domain experts and business users via a
simplified programming model
• Collaborative, shareable analysis
To revolutionize analysis and visualization by moving
high-level code and domain expertise to data
Mission
31. Big Picture
Empower domain experts with
high-level tools that exploit modern
hardware
Array Oriented Computing
32. Projects
Blaze: High-performance Python library for modern
vector computing, distributed and streaming data
Numba:Vectorizing Python compiler for multicore
and GPU, using LLVM
Bokeh: Interactive, grammar-based visualization
system for large datasets
Common theme: High-level, expressive language for
domain experts; innovative compilers & runtimes
for efficient, powerful data transformation
33. Objectives - Blaze
• Flexible descriptor for tabular and semi-structured data
• Seamless handling of:
• On-disk / Out of core
• Streaming data
• Distributed data
• Uniform treatment of:
• “arrays of structures” and
“structures of arrays”
• missing values
• “ragged” shapes
• categorical types
• computed columns
34. Blaze Status
• DataShape type grammar
• NumPy-compatible C++ calculation engine (DyND)
• Synthesis of array function kernels (via LLVM)
• Fast timeseries routines (dynamic time warping for
pattern matching)
• Array Server prototype
• BLZ columnar storage format
• 0.1 Released at beginning of summer, working on 0.2
37. Kiva:Array Server
DataShape + Raw JSON = Web Service
type KivaLoan = {
id: int64;
name: string;
description: {
languages: var, string(2);
texts: json # map<string(2), string>;
};
status: string; # LoanStatusType;
funded_amount: float64;
basket_amount: json; # Option(float64);
paid_amount: json; # Option(float64);
image: {
id: int64;
template_id: int64;
};
video: json;
activity: string;
sector: string;
use: string;
delinquent: bool;
location: {
country_code: string(2);
country: string;
town: json; # Option(string);
geo: {
level: string; # GeoLevelType
pairs: string; # latlong
type: string; # GeoTypeType
}
};
....
{"id":200533,"name":"Miawand Group","description":{"languages":
["en"],"texts":{"en":"Ozer is a member of the Miawand Group. He lives in the
16th district of Kabul, Afghanistan. He lives in a family of eight members. He
is single, but is a responsible boy who works hard and supports the whole
family. He is a carpenter and is busy working in his shop seven days a week.
He needs the loan to purchase wood and needed carpentry tools such as tape
measures, rulers and so on.rn rnHe hopes to make progress through the
loan and he is confident that will make his repayments on time and will join
for another loan cycle as well. rnrn"}},"status":"paid","funded_amount":
925,"basket_amount":null,"paid_amount":925,"image":{"id":
539726,"template_id":
1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He wants
to buy tools for his carpentry shop","delinquent":null,"location":
{"country_code":"AF","country":"Afghanistan","town":"Kabul
Afghanistan","geo":{"level":"country","pairs":"33
65","type":"point"}},"partner_id":
34,"posted_date":"2010-05-13T20:30:03Z","planned_expiration_date":null,"loa
n_amount":925,"currency_exchange_loss_amount":null,"borrowers":
[{"first_name":"Ozer","last_name":"","gender":"M","pictured":true},
{"first_name":"Rohaniy","last_name":"","gender":"M","pictured":true},
{"first_name":"Samem","last_name":"","gender":"M","pictured":true}],"terms":
{"disbursal_date":"2010-05-13T07:00:00Z","disbursal_currency":"AFN","disbur
sal_amount":42000,"loan_amount":925,"local_payments":
[{"due_date":"2010-06-13T07:00:00Z","amount":4200},
{"due_date":"2010-07-13T07:00:00Z","amount":4200},
{"due_date":"2010-08-13T07:00:00Z","amount":4200},
{"due_date":"2010-09-13T07:00:00Z","amount":4200},
{"due_date":"2010-10-13T07:00:00Z","amount":4200},
{"due_date":"2010-11-13T08:00:00Z","amount":4200},
{"due_date":"2010-12-13T08:00:00Z","amount":4200},
{"due_date":"2011-01-13T08:00:00Z","amount":4200},
{"due_date":"2011-02-13T08:00:00Z","amount":4200},
{"due_date":"2011-03-13T08:00:00Z","amount":
4200}],"scheduled_payments": ...
2.9gb of JSON => network-queryable array: ~5 minutes
http://192.34.58.57:8080/kiva/loans
38. Akamai Dataset ETL
Hive Python script
Hardware
Memory
Time
(traceroute)
Routes/hr/Ghz
8x 16 core, 2 GHz
(128 cores)
1x 8 core, 2.2 GHz
RAM: 8x 382 GB
HDD: 8x 15k rpm
RAM: 144 GB
HDD: 2x 7200rpm
5 hrs, 635M routes 11 hrs, 113M routes
496k 584k
• Python performs ~18% better with almost no optimization
• resulting IPMap can be used for realtime, online query and
aggregation
39. Querying Traceroute in BLZ format
1k Random1k Random Full ScanFull Scan
Time RAM Time RAM
BLZ (disk)
BLZ (mem)
NPY (memmap)
NumPy (mem)
3.5s 0.04mb 2.9s 8mb
2.37s 210mb 2.4s 210mb
0.24s 0.2mb 0.23s 602mb
.13s 603mb 0.23s 603mb
Meant for dealing with Big Data
(RAM consumption is extremely low)
40. Numba
• Just-in-time, dynamic compiler for Python
• Optimize data-parallel computations at call time,
to take advantage of local hardware configuration
• Compatible with NumPy, Blaze
• Leverage LLVM ecosystem:
• Optimization passes
• Inter-op with other languages
• Variety of backends (e.g. CUDA for GPU support)
44. Image Processing
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
45. Example: MandelbrotVectorized
from numbapro import vectorize
sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32,
uint32)'
@vectorize([sig], target='gpu')
def mandel(tid, min_x, max_x, min_y, max_y, width,
height, iters):
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
x = tid % width
y = tid / width
real = min_x + x * pixel_size_x
imag = min_y + y * pixel_size_y
c = complex(real, imag)
z = 0.0j
for i in range(iters):
z = z * z + c
if (z.real * z.real + z.imag * z.imag) >= 4:
return i
return 255
Kind Time Speed-up
Python 263.6 1.0x
CPU 2.639 100x
GPU 0.1676 1573x
Tesla S2050
46. Bokeh
• Language-based (instead of GUI) visualization system
• High-level expressions of data binding, statistical transforms,
interactivity and linked data
• Easy to learn, but expressive depth for power users
• Interactive
• Data space configuration as well as data selection
• Specified from high-level language constructs
• Web as first class interface target
• Support for large datasets via intelligent downsampling
(“abstract rendering”)
47. Bokeh
Inspirations:
• Chaco: interactive, viz pipeline for large data
• Protovis & Stencil :
Binding visual Glyphs to data and expressions
• ggplot2: faceting, statistical overlays
Design goal:
Accessible, extensible, interactive plotting for the web...
... for non-Javascript programmers
51. Kiva:Abstract Rendering
Basic AR can identify trouble spots in standard plots, and also
offer automatic tone mapping, taking perception into account.
37 mil elements, showing adjacency between entities in Kiva dataset
55. http://Wakari.io
• Cloud-hosted Python analytics environment
• Full Linux sandbox for every user
• IPython notebook
• Interactive Javascript plotting
• Easy to share notebooks & code with other users
• Free plan: 512mb memory, 10gb disk
• Premium plans include: More powerful machines,
more memory/disk, SSH access, cluster support
58. White House Big Data Initiative
• $200 million for NIH, NSF,
DOE, DOD, USGS
• DoD investing $60 mil
annually on new programs
• $25 mil for XDATA
59. DARPA XDATA (BAA-12-38)
“A large and critical part of DoD data can be characterized as semi-
structured, heterogeneous, and scientifically collected data with
varied amounts of completeness and standardization. Therefore a
one-size-fits-all end-to-end system is unlikely to meet all
analytical goals...”
“DoD collected data are particularly difficult to deal with, including
missing data, missing connections between data, incomplete data,
corrupted data, data of variable size and type, etc.”
60. XDATA Needs
“MapReduce ... results in selection bias for certain types of
problems, which may prevent ... a comprehensive understanding
of the data.”
• Develop analytical principles which scale across data volume and
distributed architecture
• Minimize design-to-execution time
• Leverage problem structure to create new algorithms that trade-
off time/space/stream complexity
• Distributed sampling & estimation techniques
• Distributed dimensionality reduction, matrix fact.
• Determining optimal cloud configuration & resource allocation
with asymmetric components (GPU, big-mem nodes, etc.)