Megaputer Blog

Improving how mixed sources of data are accurately merged together through the use of fuzzy joins

21.05.2018Yuri Slynko, Data Mining

In our daily work we often need to combine two or more datasets together into one. This type of operation, known as a join, is rather simple when each record contains a unique ID present in both datasets. However, there are many scenarios where datasets use different methods of creating unique keys and thus do not match or do not have unique keys at all. In these situations the traditional join operation does not suffice. For example, we have many projects involving the analysis of individual people. One dataset may be from one source such as a hospital which will contain medical data for that person while another dataset may be from another source such as an insurance company which will contain policy information. It is unlikely that these two institutions share the same record keeping system in which real world individuals are given the same unique key in both…

Читать далее

Add support for stop list dictionaries when analyzing spelling errors

15.05.2018Yuri Slynko, , Text Analytics

In previous builds of PolyAnalyst the only way to stop a word from showing up in the spell checker was to add it to the morphology dictionary. This would sometimes result in having to add things to the morphology dictionary that might not really belong there, such as product codes (Model ACBXYZ) and the occasional single foreign word (e.g. yukata). In order to prevent this dictionary mismatch PolyAnalyst now includes stoplist functionality for spellchecking. In other words you can define a list of words for the spellchecker to ignore without assigning them as actual English words. Without a stop list: With a stop list: no more «yukata»!

Читать далее

Manual validation of entities in Entity Extraction

15.05.2018Yuri Slynko, , Text Analytics

Algorithmic extraction of entities from text is a powerful tool and a core feature of PolyAnalyst, but it can be difficult to get results with absolutely no false positives. Previously, the user would have to reword the algorithm little by little to remove each false positive, or find some other way to filter the results. Fortunately PolyAnalyst supports manual validation of entity extraction results. In effect this means that a user can mark each extraction as being invalid, valid or null, and only those that are not invalid will make their way into the actual dataset. This has a number of advantages beyond not needing to write additional Extended Pattern Definition Language (XPDL) code. For one, it makes it easier for multiple users to collaborate and manipulate the results. Additionally, other users can see what types of extractions are being marked as invalid, which may give them a deeper understanding…

Читать далее

Навигация по записям

< 1 2 3 4 5 6 >

Categories

Featured articles

  • Why We Use Dependency Parsing
  • Prepping the XPDL Seminar Series

Social links

Email
Facebook
Twitter
Google+
LinkedIn
YouTube

Продукты

PolyAnalyst Pro

PolyAnalyst Text

PolyAnalyst Reports

Sapremo

Решения

Интеллектуальные решения

Галерея проектов

Обучение

Видео-инструкции

Лекторий по анализу данных

Документация

Проверка сертификата

Свяжитесь с нами

+7(499)7530129

info@megaputer.ru

©2000-2020. ООО «Компания «Мегапьютер Интеллидженс». Пользовательское соглашение. Политика конфиденциальности.