- Genetic programming
- Data Mining Using Grammar Based Genetic Programming and Applications
- Shop with confidence
In the study of biological systems the GP has been little applied, however recently several works have applied GP in the study of gene expression [ 54 , 55 ], modeling of algal growth [ 56 ], prediction of cancer [ 57 , 58 ], prediction of medical diagnosis [ 59 ], in the identification and classification of different types of scoliosis [ 60 ] and one area that GP has attracted interest is genome-wide association studies [ 61 ].
Example of a derivation tree [ 46 ]. Example of crossover operators of Grammar Guided GP [ 46 ]. Example of mutation operators of Grammar Guided GP [ 46 ]. Example of domination rank with two objective, the rank 1 is nondominated, rank 2 is only dominated by rank1 and rank 3 is dominated by rank1 an rank 2.
The crowding distance computation requires sorting the population according to each objective function value. Thereafter, for each objective function, the boundary solutions solutions with smallest and largest function values are assigned as an infinite crowding distance value. All other intermediate solutions are assigned a distance value equal to the absolute normalized difference in the function values of two adjacent solutions. This calculation is continued with other objective functions. The overall crowding distance value is calculated as the sum of individual distance values corresponding to each objective.
Each objective function is normalized before calculating the crowding distance. This section describes the details regarding the the computational experiments methodology. Firstly, it explains how the variables were chosen for modeling and then described as the preparation of the data with the formations of the study groups.
Daily calories Kcal mean sd. Most of the children who had missing data were due to refusal to withdraw blood samples, consequently they has missing for all serological data, or failing to provide all stool samples which made them missing all parasitological variables. This made it difficult to apply a methodology for imputation missing data. The use of individuals with missing data in the analyzes would cause different models to present different number of instances, which would compromise their adequate evaluation.
We prefer to exclude all children who had missing data for any of the variables studied were excluded from the study then from the original children, has complete data. For realization of computational experiments the population was divided into groups. This draw was made keeping the frequency of the outcome in the group equal to the frequency of the same in the original population. The first part of the population is defined as validation group and its respective training group consists of the other 4 parts.
The process continue for for each part been validation group and the other ones been their respective training group. At the end of the run 6 times the cross-validation was obtained 30 training groups with subjects, their 30 validation groups with subjects and one group with subjects, respectively. All groups have relative frequencies similar to the original population. The same groups were used in all analyzes. The study population showed more negative individuals for asthma and allergies than positive individuals unbalanced database , and we therefore applied random over-sampling [ 64 ] in each training and validation group in order to prevent the negative group for asthma and allergies from having a greater influence on the accuracy than the positive group.
Random over-sampling technique was not applied to test group. The MGGP was executed in 30 independent times for each training group. An initial population of candidate solutions was randomly generated. The population was evaluated using NSGA based on two objective functions, i minimizing the classification error of the model in the training group ii minimizing the complexity of the model, given by the number of terminals in the tree representation of the candidate solution.
The selection of parental solutions was carried out using tournament: two solutions were randomly selected and the best one of them was chosen to be a parent solution. Then the combination of two parental solutions generate two offspring solutions which suffer crossing and mutation. This process was repeated until offspring solutions were generated and evaluated. The best solutions between parent and offspring solutions were selected to form the next generation. The MGGP executed a total of 20, generations to obtain the final population.
From the population of solutions at the end of the 20, generations, the solutions chosen as best were those that were non-dominated using the error in the validation group instead of the error in the training group to avoid problems with overfitting. The RL models were generated for each of the training groups and then these generated models were evaluated on their respective validation and testing groups.
As we want to compare a regression with classification models, the RL has been converted into a classification model by applying a step function on the predicted value, meaning that if the value predicted by the model is greater than 0. These analyses were performed in Weka V3. Models using the classification algorithm C4. To avoid overfitting, the parameter of minimum number of instances per leaf was set to maximize the mean accuracy of the models for all executions in the validation groups.
These analyses were performed in Weka. J48 is the Java implementation of C4. The RF [ 67 ] algorithm was applied in the 30 training groups. The parameter maximum size of the trees chosen was 3, because this presente the smallest errors in the validation group after the models be generated in the training groups.
ROC space for training groups of RL algorithm, showing the difference of balanced data and unbalanced data. The classification error and the complexity of the set of non-dominated solutions for a training group in the final generation of an MGGP execution. The RL are classification error of RL algorithm for the same training group. Solutions with low complexity are too simple to explain asthma and allergy and consequently have low accuracy. With increasing complexity the misclassification number tends to drop, however very complex models tend to get very specific to the studied sample and lose the ability to explain other databases.
To avoid losing such ability, at the end of execution non-dominated models with respect to the validation group are selected. Despite the best model be the one with the smallest error in the validation group, the solutions with less complexity should not be discarded, as they have the potential to highlight relationships relevant to the understanding of the problem.
Most epidemiological studies use techniques that capture only linear relationships between predictor variables, as for example RL. Although C4. The RF presented accuracy equivalent to MGGP, but the objective is not to predict asthma and allergy, as this would not be expected based only on studied factors. Because these are complex pathologies with multiples still unknown risk factors. The objective of this work is to find relationships between the studied factors that could potentially be related to asthma and allergies.
RF is not useful for that objective because has little capacity to clarify these relations. The first column is odds ratio of all database without test group, second column is odds ratio in the test group, and the third column is accuracy of relation express. Dog at Home! Mother Psychological disorder! The absence of infections such as T. This model indicates that when a person has moderate levels of consumption of fish, fruit, cereals, or is male, and also shows the absence of T. Many biological phenomena do not have a linear behavior. Immune cells like lymphocytes, when stimulated have their response increased.
However, excess stimulation leads to anergy or apoptosis of these cells, thus reducing the response. This kind of behavior is hardly detected properly using RL. In case of male gender or moderate values of feed pattern 1, it is possible to see in this model and others that both the presence of sewage and the absence of T. This model indicates that excess risk factors may lead to a reduction in the chances of being IgE positive. We also performed MGGP runs for each outcome on all individuals without separating by groups. Even knowing that we could not avoid problems of overfiting, we want to observe models that take into account the maximum number of people possible.
The presence of a cat in the house and its association with asthma has presented contradictory results in literature.
Some studies find a positive association with asthma [ 70 , 71 ]. Others found a negative association [ 72 ]. One of the reasons for such disagreements between the works is that the presence of a cat may enhance asthma symptoms, so it is common for parents with asthmatic children to avoid cats, which could cause a negative association in most studies.
The list of the best models generated by MGGP in all individuals is shown in the material supplements. MGGP makes it possible to define rules to deal with variables of different types such as continuous, discrete, categorical, among others. It is also possible to define how and what operations are possible between the different types of variables. MGGP employs rules that restrict the construction of the models, allowing the researcher to add knowledge through the insertion of known relations and the removal of relations that do not make sense.
MGGP allows the researcher to define groups according to some criteria such as economic, environmental and nutritional. This type of constraint allows for the definition of different operations upon members of different groups. In a single MGGP run, solutions with different levels of complexity can be generated, which improve the understanding of intricate relationships among variables in epidemiological studies.
The application of MGGP in a study focused on asthma and allergies has shown that infections, psychosocial, nutritional, hygiene, and socioeconomic factors may be related in intricate ways with these outcomes. This kind of finding could be hardly detected properly using traditional regression based epidemiological techniques. Epstein-Baar virus. Herpes simplex virus. Herpes zoster virus.
Hepatitis A virus. Immunoglobulin E antibody. Immunoglobulin G antibody. Multiobjective grammar-based genetic programming. The authors thank Nicolas Carels for providing the cluster used for computational experiments.
Data Mining Using Grammar Based Genetic Programming and Applications
The Wellcome Trust was not involved in the design of the study, analysis, and interpretation of data, and in writing the manuscript. All authors read and approved this version of the manuscript. Ethical approval was obtained from the Brazilian National Ethical Committee. Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Methodology article Open Access. Multiobjective grammar-based genetic programming applied to the study of asthma and allergy epidemiology. Barreto 1 , 4. BMC Bioinformatics 19 Abstract Background Asthma and allergies prevalence increased in recent decades, being a serious global health problem.
Conclusions MGGP has shown that infections, psychosocial, nutritional, hygiene, and socioeconomic factors may be related in such an intricate way, that could be hardly detected using traditional regression based epidemiological techniques. Genetic programming Asthma Allergy Classifier Multiobjective. Some studies have shown that grammatical genetic programming can be applied to several problems obtaining good results [ 10 — 12 ] Multiobjective optimization problems MOOP are ubiquitous in real-world decision making.
Study population and data collection The study was a post hoc analysis of data collected during a survey of children aged years and living in 24 poor neighborhoods in the city of Salvador, Northeast Brazil, performed in as part of a cohort study to investigate risk factors for asthma and allergy, and is described in detail elsewhere [ 33 ]. Psychological disorder in the mother The SRQ questionnaire was used to assess minor psychiatric disorders in the mother.
Dietary patterns Information about the dietary patterns were obtained based on questionnaire of food frequency, validated by [ 42 ]. Detection of intestinal helminth ova in fecal samples Two fecal samples were collected two days apart and analyzed using the Hoffman sedimentation method and the Kato-Katz thick-smear technique [ 44 ] for the presence of helminth parasites Trichuris trichiura, Ascaris lumbricoides, hookworms and Schistosoma mansoni. Genetic Programming GP Genetic programming GP is a special type of genetic algorithm which creates computational artifacts for instance, computer programs written in a given language to perform a given task.
Grammar guided GP [ 62 ], or grammar-based GP, uses grammars as a way to constrain the representation of the candidate solutions. The candidate programs in Grammar guided GP are represented by derivation trees, in which the internal nodes are the nonterminals of the grammar and the leaf nodes are symbols which appear in the language terminals. An example of a derivation is available in the Fig. Grammar guided GP uses a grammar to guide the allowed representation of the candidate programs.
The use of grammar delimits the creation of the initial population as well as the application of the variational operators as mutation and recombination. For both mutation and recombination, it is only permissible to exchange a non-terminal N for another of the same type, thus maintaining the consistency of the models. The recombination operator is shown in Fig. It is randomly selected a non-terminal that exists in both parents and occurs the exchange of subtrees between parents.
The mutation operator is shown in Fig. An optimization problem seeks to find a solution that maximizes or minimizes an objective. However, many problems require finding the best solutions according to multiple objectives, thus being a multiobjective optimization problems MOOP. The search for the relationships between factors associated with complex diseases such as asthma can be studied as a MOOP, where it is aim to maximize the accuracy and minimize the complexity of the relations simutaneously.
This multiobjective approach aims to find the models that best explain this pathology being as simple as possible and therefore more parsimonious. The Grammar guided GP usually is applied to a mono-objective problem. To create the capability to solve MOOP, instead of using the obtained value of the objective function as criterion for selecting the best solutions in mono-objective problem.
Where one solution dominates the other if this solution is better in relation to all objectives, otherwise the solution is non-dominated. The NSGA uses two criteria for selecting the best solutions based on the objective functions: The dominance rank. All solutions which there is no other solution that is better than it for all objective functions simultaneously is call a nondominated solution. The rank 1 is formed by all nondominated solutions, rank 2 is formed for all solutions that are dominated by only rank1, and so on.
This idea is illustrated in Fig. The exposure variables chosen were those that potentially represent the aspects that may be related directly or indirectly with asthma and allergy. Table 1 Variables used to build Models. To be easier to handle the solution computationally, it was used postfix notation. This population had high prevalence of asthma Such high prevalence has as consequence, the number of positive cases approaching the number of negatives cases, so that an unbalanced problem was not expected.
However, as shown in Fig. Other studies also showed the importance of data balancing in classification algorithms applied to epidemiological problems [ 7 , 68 ].
- Lesson Plans The Adventures of Huckleberry Finn.
- Data Mining Using Grammar Based Genetic Programming and Applications | Man Leung Wong | Springer.
- The Decisions We Make.
All executions of MGGP showed a good range of tradeoff between complexity and error. An execution example is displayed in Fig. This shows that the MGGP was able to find a diverse set of optimal solutions, each with different tradeoff between complexity and accuracy. It is evident that for the set of non-dominated solutions be large, it is not possible to generate low complexity solutions with low misclassification, because that would make this solution dominate the other solutions and reduce the size of the non-dominated set.
The set of solutions obtained by MGGP are non-dominated solutions with respect to the validation group obtained at the last generation. To evaluate these solutions the accuracy in the test group was adopted. The test group is a single group for every 30 runs of the algorithm. So it used to test the general performance of a given solution in different executions. Although most of the best solutions obtained by MGGP showed complexity lower than 50 terminals, a few complex solutions with good accuracy and generalization were found.
Each MGGP run took an average of The current version does not have parallelism capability and we expect to have great performance impact when parallelization is implemented in a future release. The average accuracy comparison among RL, C4. With respect to asthma, RF, C4.
For asthma we note that an important feature that appears in many relationships is the low age. Asthma is a heterogeneous condition with different phenotypes and clinical expressions. A common phenotype of asthma is the transient wheezing phenotype that is not commonly associated with a family history of asthma or with atopy.
For this phenotype, the symptoms tend to regress at age years old [ 69 ], and the high prevalence of this phenotype may explain this relation with low age. Some less complex relationships commonly found were: i low age or dog at home are related to asthma, indicating that dog at home is also related with increased asthma, ii cat at home or low age increasing asthma, indicating that cat at home is also related with increased asthma iii suspected mother psychological disorder also show increased chance to be asthmatic.
Some relationships found that affect the chance of being positive for SPT were the presence of infections T. Other important relation found with SPT, was the high consumption of foods rich in frying pattern 3 and predominance of sausages, eggs and red meat pattern 4. This results indicating that those infections, environment, and feeding behavior may influence SPT positivity. The use of MGGP can be a good alternative to the understanding of epidemiological problems mainly in the study of complex diseases.
Among the qualities presented by this technique, we can highlight: MGGP works with classification models and non-linear regression. Acknowledgements The authors thank Nicolas Carels for providing the cluster used for computational experiments.
Ethics approval and consent to participate Ethical approval was obtained from the Brazilian National Ethical Committee. Competing interests The authors declare that they have no competing interests. Phenotypes, risk factors, and mechanisms of adult-onset asthma. Mediat Inflamm.
Alternatives for logistic regression in cross-sectional studies: an empirical comparison of models that directly estimate the prevalence ratio. Logistic regression models.
Shop with confidence
Allergol Immunopathol. Cancer Informat. Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infectio. Rule-based models of the interplay between genetic and environmental factors in childhood allergy. PloS ONE. Google Scholar Koza JR. Cambrige: MIT press; Should you wish to discuss any potential idea for a project further or receive specific information regarding our book proposal requirements, please contact either John Koza or Melissa Fearon.
A short biography is helpful in evaluating a proposal. John R. David E. Goldberg is the consulting editor for the book series on genetic algorithms and evolutionary computation book series from Kluwer Academic Publishers. Koza at Genetic Programming Inc. Koza at Stanford University. Call for Book Proposals for Kluwer Book Series on Genetic Programming The Kluwer book series on genetic programming will cover applications of genetic programming, theoretical foundations of genetic programming, technique extensions, implementation issues, and applications.