DATA EXPERTS 
We accelerate research and transform data to help you create actionable insights 
WE 
MINE 
WE 
ANALYZE 
WE 
VISUALIZE
Data Scientists 
Software Engineers 
Domain Analysts
Data Mining 
Mining longitudinal and linked datasets from web and other archives
Web Data Mining 
Indian Patent Data 
Task in hand was to write a code to harvest Indian Patent Data from Indian Patent Office Website on an ongoing basis 
Code was prepared to extract information on all patents filed between date A and date B defined by user. 
Further, a data quality tool was written to clean data from the harvested pool of data. 
A scheduling code was prepared to run harvesting code followed by data quality code every week (as Indian Patent office publishes patents every week). 
This master scheduler code runs every week and updates all fields of new patents in the master pool of database.
Task in hand was to write a code to harvest latest earning conference calls of all NASDAQ listed firms and parse each call at individual speaker level 
Earning Conference calls were harvested from NASDAQ, Thomson StreetEvents and other sources on a daily basis 
Automation code was written to parse those conference calls at individual speaker level with meta data like speaker name, designation, company and speech text 
A scheduling code was prepared to run harvesting code and parsing code followed by harvesting every day (from NASDAQ) 
This master scheduler code runs every day and updates all fields of new conference calls in the master pool of database. 
Web Data Mining 
Earning Conference Calls
Task in hand was to create a longitudinal data-set containing details of CEO Compensation 
DEF 14-A Proxy Filings at SEC Edgar for each company contains details of how the CEO was compensated and how much 
Automation code was written to parse those proxy filings and pull out total compensation as well as break-up including basic salary, bonus, EPS, etc. 
A team of financial domain analysts manually studied proxy filings and created dummy variables for various kinds of compensation strategies used in practice to compensate CEO/Key Executives 
Finally, the data-set was cleaned up, standardized and sent to quality team for vetting. 
Archive Data Mining 
CEO Compensation
Statistical Modeling 
Curating statistical models or running statistical analysis on real or simulated data
Task in hand was to prepare, implement, and realize statistical models to optimize school transportation and logistics costs for real cities based on real data 
A Professor was appointed by NPO and a government to prepare statistical models aiming at city planning for optimizing school costs 
Our statisticians used clustering techniques, optimization techniques, and multi-processing to minimize total transportation + operational cost on schools for a city. 
Various constraints and difficulties were taken care of like bus capacity, total riding time, mixed loading, land and sea routes, different types of buses and classroom sizes, and many more. 
A graphical user interface with flexibility was delivered to client to view output on maps, spreadsheets, manually change routes, upload input data etc. 
Statistical Modeling - Optimization 
Logistics Optimization
Using Hidden Markov Models to predict various hidden layers of consumer behavior 
Using a hypothesis that consumers happen to have sequential hidden phases of behavior towards a product ecosystem over time, our task was to validate that there are hidden layers and understanding their transitions with time. 
Using data on adoption and rejection of various kinds of product in a product ecosystem by a consumer from a panel data having more than 10,000 products and 100,000 consumers, we prepared a hidden markov model with maximum 8 states and 20 time-periods. 
Computation time for this analysis was of exponential order with each iteration of optimization had to store Gigabytes of data. Thus an Hadoop layer on R was used to reduce time and memory complexity on a single machine using HDFS on five machines. 
Statistical Modeling 
Hidden Markov Model
Data Analysis – Predictive Analysis Predicting Hospital Readmission Rates 
Predicting and Benchmarking Re-admission rates using Re-Admission data and Performance and cause factors. 
With the implementation of Obama care bill hospitals now have a pressing need to significantly reduce their readmission rates or end up paying huge penalties. A comprehensive solution requires collating all readmission data available from various sources and then identifying the causal factors affecting Re-Admission. 
AIC + BIC Analysis was carried out for each disease to remove variables that share insignificant dependence with Re-Admission Rates, leaving us with key variables for each disease. After running Multiple Linear Regression with Re-Admission Rates as Response Variable and other variables as Explanatory Variables, the results met the expectation of common logic. 
We further developed a BI tool using a comprehensive model to successfully predict readmission rates and their dependence on various KPIs.
Big Data Analysis 
Implementing distributed computing methods and cutting edge statistical models to analyze huge chunks of data
Data Analysis – Contextual Analysis 
Sentiment Analysis 
Navigating through millions of tweets to understand sentimental behavior of people 
Millions of tweets for various people were extracted using data harvesting techniques. 
Using Naïve Bayes Analysis and a training set of 100,000 tweets, a supervised classifier was established to decide sentiment of any tweet. This analysis involved prior cleaning of stop words and substituting prolonged chat versions of words with their English word counterparts. 
Sentiments of people were analyzed in various control groups. 
It was not possible to complete this project in a record two month timeframe without use of big data technologies.
Data Analysis – Big Data Analysis 
Analyzing high-frequency Trading data 
Querying high frequency data from limit order book for all NYSE-traded securities 
The project involved analyzing high frequency datasets obtained from NYSE Open Book and Open Book Ultra. 
One of the biggest challenge in analyzing this data was the scale of data, the total size of data was 30 TB and increasing. Performing simple analysis involved querying the database which took several hours at time. 
We distributed the computation using algorithms based on MapReduce Technology and implemented the same using python. This enabled near 
real-time querying. 
This setup was later used for analyzing price variation and impact on market liquidity. The distributed computing approach saved lot of analysis time.
Data Analysis – Prescriptive Analysis 
Personality Analysis 
Personality of Key Executives of various companies was examined using data on text-recorded public speeches of these key executives 
Having access to parsed database of speeches of key executives of NASDAQ traded firms, we performed personality analysis of over 30,000 people (including chief executives and analysts) and 10 million words using pysho- linguistic databases of words and SVM technique. 
Each person was given a score from 1 to 8 in big five personality traits. 
These traits were correlated with following quarters conference calls and patterns were observed in personality of chief executives and their effect on financial (especially stock) performance in the following quarter.
Data Standardization 
Converting raw data into cleaned and standardized formats
Merging company names in financial and patent data-sets 
Task was to merge US financial data-set having 15,000 unique standardized company names and US patent data-set having 800,000+ unstandardized unique company names. 
After a thorough fine-tuning of our hybrid model for company name matching, we reduced those 800,000+ unstandardized unique company names to 200,000+ standardized company names and mapped all 15,000 company names in finance data-set to patent data-set. 
Results were spectacular worth mentioning that we found 105 types of spelling mistakes of IBM which we grouped in one standardized name. 
Data Standardization 
Merging Patent and Finance Data
Standardizing names of plaintiffs and defendants across a litigation data- set containing approximately 4.7 million unique names 
Used benchmark algorithms on Levenshtein & N-gram to perform standardization of names, and implemented them in Python 
Prepared and optimized hybrid parameter models for the above algorithms 
Implemented the above on a five-node cluster using Hadoop Distributed File System technology 
Data Standardization 
Standardizing Names of individuals
Technology Implementation 
Implementing customized mobile and software tools to make research faster, scalable and economical
Developing Mobile and cloud based survey application for researchers for conducting, analyzing and reporting surveys 
Technology Implementation 
Developing Mobile Applications 
The problem of capturing data in surveys, digitizing results, performing analyses was a long drawn error prone manual process. 
We worked with researchers at the university to develop an android application for collecting data on the smartphones on the on ground surveyors. 
The result – no skip logics, no digitization, audio and video responses, an analysis dashboard, GPS location reliability, and all within a click of a button.
Building a platform to assist in collecting, tracking, and analysis of large data-sets 
Technology Implementation 
Developing Desktop Applications 
Developed a login based offline .NET framework technology which collects survey data, syncs local data with cloud, and allows users to analyze data, all on a continual basis using the most reliable and scalable cloud-based database technology, Google SQL 
Considering physical constraints like electricity and internet shortages, our team developed a technology which can be installed at any remote terminal where they would operate their data collection activities 
It was made sure that all local copies of the data is kept in their local machines until an internet connection is activated which then syncs data to a cloud database which can be accessed by client’s core team members from anywhere any time
Programmed the mail-engine to send and monitor distributed email campaigns 
Technology Implementation 
Developing Email Delivery and Analytics Engine 
One of the major challenges faced by the researcher was the lack of a comprehensive solution to disseminate information, to the defined target segment. Conventional programs were cluttered, repeatedly hit by delivery failures and could not provide the key insights. 
The application was hosted on appengine with a scalable infrastructure to enable real-time monitoring and delivery, whereas the database was centrally managed on Google Cloud SQL 
The application was programmed using distributed loading techniques to send more than 1 million emails per day.
Data Visualization 
Creating customized visualization for your data-sets to tell a compelling story
Data Visualization 
Industry Turbulence
Data Visualization 
Countries in Motion
Data Visualization 
Web Interactive Customizable World Maps
Clients 
Researchers from
Contact 
We like challenges 
Keep on throwing questions and hear from our engagement managers on how we can help. 
Write to us at info@innovaccer.com 
Or have a look at us www.innovaccer.com 
Or Call us +91 120 431 1139

Innovaccer service capabilities with case studies

  • 1.
    DATA EXPERTS Weaccelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE
  • 2.
    Data Scientists SoftwareEngineers Domain Analysts
  • 3.
    Data Mining Mininglongitudinal and linked datasets from web and other archives
  • 4.
    Web Data Mining Indian Patent Data Task in hand was to write a code to harvest Indian Patent Data from Indian Patent Office Website on an ongoing basis Code was prepared to extract information on all patents filed between date A and date B defined by user. Further, a data quality tool was written to clean data from the harvested pool of data. A scheduling code was prepared to run harvesting code followed by data quality code every week (as Indian Patent office publishes patents every week). This master scheduler code runs every week and updates all fields of new patents in the master pool of database.
  • 5.
    Task in handwas to write a code to harvest latest earning conference calls of all NASDAQ listed firms and parse each call at individual speaker level Earning Conference calls were harvested from NASDAQ, Thomson StreetEvents and other sources on a daily basis Automation code was written to parse those conference calls at individual speaker level with meta data like speaker name, designation, company and speech text A scheduling code was prepared to run harvesting code and parsing code followed by harvesting every day (from NASDAQ) This master scheduler code runs every day and updates all fields of new conference calls in the master pool of database. Web Data Mining Earning Conference Calls
  • 6.
    Task in handwas to create a longitudinal data-set containing details of CEO Compensation DEF 14-A Proxy Filings at SEC Edgar for each company contains details of how the CEO was compensated and how much Automation code was written to parse those proxy filings and pull out total compensation as well as break-up including basic salary, bonus, EPS, etc. A team of financial domain analysts manually studied proxy filings and created dummy variables for various kinds of compensation strategies used in practice to compensate CEO/Key Executives Finally, the data-set was cleaned up, standardized and sent to quality team for vetting. Archive Data Mining CEO Compensation
  • 7.
    Statistical Modeling Curatingstatistical models or running statistical analysis on real or simulated data
  • 8.
    Task in handwas to prepare, implement, and realize statistical models to optimize school transportation and logistics costs for real cities based on real data A Professor was appointed by NPO and a government to prepare statistical models aiming at city planning for optimizing school costs Our statisticians used clustering techniques, optimization techniques, and multi-processing to minimize total transportation + operational cost on schools for a city. Various constraints and difficulties were taken care of like bus capacity, total riding time, mixed loading, land and sea routes, different types of buses and classroom sizes, and many more. A graphical user interface with flexibility was delivered to client to view output on maps, spreadsheets, manually change routes, upload input data etc. Statistical Modeling - Optimization Logistics Optimization
  • 9.
    Using Hidden MarkovModels to predict various hidden layers of consumer behavior Using a hypothesis that consumers happen to have sequential hidden phases of behavior towards a product ecosystem over time, our task was to validate that there are hidden layers and understanding their transitions with time. Using data on adoption and rejection of various kinds of product in a product ecosystem by a consumer from a panel data having more than 10,000 products and 100,000 consumers, we prepared a hidden markov model with maximum 8 states and 20 time-periods. Computation time for this analysis was of exponential order with each iteration of optimization had to store Gigabytes of data. Thus an Hadoop layer on R was used to reduce time and memory complexity on a single machine using HDFS on five machines. Statistical Modeling Hidden Markov Model
  • 10.
    Data Analysis –Predictive Analysis Predicting Hospital Readmission Rates Predicting and Benchmarking Re-admission rates using Re-Admission data and Performance and cause factors. With the implementation of Obama care bill hospitals now have a pressing need to significantly reduce their readmission rates or end up paying huge penalties. A comprehensive solution requires collating all readmission data available from various sources and then identifying the causal factors affecting Re-Admission. AIC + BIC Analysis was carried out for each disease to remove variables that share insignificant dependence with Re-Admission Rates, leaving us with key variables for each disease. After running Multiple Linear Regression with Re-Admission Rates as Response Variable and other variables as Explanatory Variables, the results met the expectation of common logic. We further developed a BI tool using a comprehensive model to successfully predict readmission rates and their dependence on various KPIs.
  • 11.
    Big Data Analysis Implementing distributed computing methods and cutting edge statistical models to analyze huge chunks of data
  • 12.
    Data Analysis –Contextual Analysis Sentiment Analysis Navigating through millions of tweets to understand sentimental behavior of people Millions of tweets for various people were extracted using data harvesting techniques. Using Naïve Bayes Analysis and a training set of 100,000 tweets, a supervised classifier was established to decide sentiment of any tweet. This analysis involved prior cleaning of stop words and substituting prolonged chat versions of words with their English word counterparts. Sentiments of people were analyzed in various control groups. It was not possible to complete this project in a record two month timeframe without use of big data technologies.
  • 13.
    Data Analysis –Big Data Analysis Analyzing high-frequency Trading data Querying high frequency data from limit order book for all NYSE-traded securities The project involved analyzing high frequency datasets obtained from NYSE Open Book and Open Book Ultra. One of the biggest challenge in analyzing this data was the scale of data, the total size of data was 30 TB and increasing. Performing simple analysis involved querying the database which took several hours at time. We distributed the computation using algorithms based on MapReduce Technology and implemented the same using python. This enabled near real-time querying. This setup was later used for analyzing price variation and impact on market liquidity. The distributed computing approach saved lot of analysis time.
  • 14.
    Data Analysis –Prescriptive Analysis Personality Analysis Personality of Key Executives of various companies was examined using data on text-recorded public speeches of these key executives Having access to parsed database of speeches of key executives of NASDAQ traded firms, we performed personality analysis of over 30,000 people (including chief executives and analysts) and 10 million words using pysho- linguistic databases of words and SVM technique. Each person was given a score from 1 to 8 in big five personality traits. These traits were correlated with following quarters conference calls and patterns were observed in personality of chief executives and their effect on financial (especially stock) performance in the following quarter.
  • 15.
    Data Standardization Convertingraw data into cleaned and standardized formats
  • 16.
    Merging company namesin financial and patent data-sets Task was to merge US financial data-set having 15,000 unique standardized company names and US patent data-set having 800,000+ unstandardized unique company names. After a thorough fine-tuning of our hybrid model for company name matching, we reduced those 800,000+ unstandardized unique company names to 200,000+ standardized company names and mapped all 15,000 company names in finance data-set to patent data-set. Results were spectacular worth mentioning that we found 105 types of spelling mistakes of IBM which we grouped in one standardized name. Data Standardization Merging Patent and Finance Data
  • 17.
    Standardizing names ofplaintiffs and defendants across a litigation data- set containing approximately 4.7 million unique names Used benchmark algorithms on Levenshtein & N-gram to perform standardization of names, and implemented them in Python Prepared and optimized hybrid parameter models for the above algorithms Implemented the above on a five-node cluster using Hadoop Distributed File System technology Data Standardization Standardizing Names of individuals
  • 18.
    Technology Implementation Implementingcustomized mobile and software tools to make research faster, scalable and economical
  • 19.
    Developing Mobile andcloud based survey application for researchers for conducting, analyzing and reporting surveys Technology Implementation Developing Mobile Applications The problem of capturing data in surveys, digitizing results, performing analyses was a long drawn error prone manual process. We worked with researchers at the university to develop an android application for collecting data on the smartphones on the on ground surveyors. The result – no skip logics, no digitization, audio and video responses, an analysis dashboard, GPS location reliability, and all within a click of a button.
  • 20.
    Building a platformto assist in collecting, tracking, and analysis of large data-sets Technology Implementation Developing Desktop Applications Developed a login based offline .NET framework technology which collects survey data, syncs local data with cloud, and allows users to analyze data, all on a continual basis using the most reliable and scalable cloud-based database technology, Google SQL Considering physical constraints like electricity and internet shortages, our team developed a technology which can be installed at any remote terminal where they would operate their data collection activities It was made sure that all local copies of the data is kept in their local machines until an internet connection is activated which then syncs data to a cloud database which can be accessed by client’s core team members from anywhere any time
  • 21.
    Programmed the mail-engineto send and monitor distributed email campaigns Technology Implementation Developing Email Delivery and Analytics Engine One of the major challenges faced by the researcher was the lack of a comprehensive solution to disseminate information, to the defined target segment. Conventional programs were cluttered, repeatedly hit by delivery failures and could not provide the key insights. The application was hosted on appengine with a scalable infrastructure to enable real-time monitoring and delivery, whereas the database was centrally managed on Google Cloud SQL The application was programmed using distributed loading techniques to send more than 1 million emails per day.
  • 22.
    Data Visualization Creatingcustomized visualization for your data-sets to tell a compelling story
  • 23.
  • 24.
  • 25.
    Data Visualization WebInteractive Customizable World Maps
  • 26.
  • 27.
    Contact We likechallenges Keep on throwing questions and hear from our engagement managers on how we can help. Write to us at info@innovaccer.com Or have a look at us www.innovaccer.com Or Call us +91 120 431 1139